-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonicalization for cell arrays #552
base: master
Are you sure you want to change the base?
Changes from all commits
a2a61f8
7ce1842
ddeaf42
1c741bd
f1a6e9a
d372101
b9d5af2
f87297b
8c90200
d7bf43e
cbf69b3
8b07372
81d7e2f
b55c886
6368edf
34e6bf2
04c853d
0355ce7
39cb4e0
fc2da34
bec6d6b
8076474
0e810f3
5a6c581
948d2e8
201c6c1
24f6e2b
26d02dc
a9e1777
3eed379
d624f97
5475831
ea81e27
6e3fcaf
428854c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# RFC: Canonicalization for H3 cell sets | ||
|
||
* **Authors**: AJ Friend | ||
* **Date**: 2021-12-122 | ||
* **Status**: Draft | ||
|
||
## Abstract | ||
|
||
We propose a canonical form for **sets** of H3 cells based on | ||
the "lower 52 bit" ordering. We also introduce fast "spatial join" operations | ||
on the cell sets (like cell-in-set, intersection, union, etc.) that exploit | ||
the canonical structure for speed gains. | ||
|
||
|
||
## Motivation | ||
|
||
A canonical form for cell sets is useful when testing if two sets are equal. | ||
That is, we'd like to be able to tell if two H3 cell arrays represent | ||
the same mathematical set of cells, ignoring ordering or duplicated cells. | ||
|
||
If we have a function to canonicalize an H3 cell array, then we would | ||
consider two arrays to be equivalent (as sets) if they each canonicalize | ||
to the exact same (canonical) cell array. | ||
|
||
A canonical form is also useful if a user wanted to deterministically | ||
hash H3 cell sets, and wanted the hash to be independent of ordering | ||
or duplicates. | ||
|
||
The canonical form we'll propose also has the added benefit of allowing | ||
for fast "spatial join" operations on canonicalized sets. For example, | ||
we'll be able to do a fast binary search to see if a cell is a member | ||
of a set, and an **even faster** binary search if the set is both | ||
compacted and canonicalized. | ||
|
||
We'll get the same benefits computing the intersection of sets, or | ||
simply testing for intersection. | ||
|
||
The canonical form also suggests a new "in-memory" cell compaction algorithm, | ||
which avoids any dynamic memory allocation. This new compact algorithm | ||
has the added benefit of returning cell arrays already in canonical form. | ||
|
||
## Terminology | ||
|
||
We propose a canonical form based on the "lower 52 bit" ordering, that is, | ||
the ordering you would get if you only considered the lower 52 bits of the | ||
H3 cell indexes. The lower 52 bits of an H3 index consist of 7 bits for the | ||
base cell, and 3 bits for each of the 15 resolution digits. That sums up | ||
to `7 + 3*15 = 52`. | ||
|
||
We'll only define this ordering for H3 **cells**. We're not considering | ||
vertices or edges in this RFC. | ||
|
||
The lower 52 bit ordering can be implemented, for example, by | ||
the `cmpLow52` comparison function given below. | ||
|
||
|
||
```c | ||
int cmpLow52(H3Index a, H3Index b) { | ||
a <<= 12; | ||
b <<= 12; | ||
|
||
if (a < b) return -1; | ||
if (a > b) return +1; | ||
return 0; | ||
``` | ||
|
||
This ordering has the property that children cells are always less than | ||
their parent cells. Ordered in an array with cells of multiple resolutions, | ||
children cells are always to the left of their parents. | ||
|
||
We can also get slightly richer ordering information with a comparison function | ||
with a declaration like | ||
|
||
```c | ||
int cmpCanon(H3Index a, H3Index b); | ||
``` | ||
|
||
defined so that: | ||
|
||
- `cmpCanon(a, b) == 0` if `a` and `b` are the same cell | ||
- `cmpCanon(a, b) == -1` if `a` is a child (or further descendant) of `b` | ||
- `cmpCanon(a, b) == +1` if `b` ... `a` | ||
- `cmpCanon(a, b) == -2` if `a` < `b` in the low52 ordering, but they are not related | ||
- `cmpCanon(a, b) == +2` if `b` < `a` ... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the use case for this? |
||
|
||
Note that these two functions produce the same ordering when given to | ||
the C standard library's `qsort`. | ||
|
||
### Array classifications | ||
|
||
Given these comparison functions, we can define 3 increasingly strict properties | ||
on arrays of H3 cells: | ||
|
||
1. "lower 52" ordered | ||
2. canonical | ||
3. compacted and canonical | ||
|
||
#### Low52 ordered | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Not keen on this name. Maybe "cell digit ordered" or similar? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to avoid just using "ordered", since there are potentially multiple different orderings someone might do, like the standard What don't you like about lower 52? :) I liked that it was a very distinct name, so it was immediately clear to a reader that you're talking about a specific concept. It's also easier to code search. I'm worried that "cell digit ordered" is a bit too generic of a name. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'd prefer the actual ordering to be opaque to the end user - the idea is:
Beyond that, the user shouldn't care. Calling this "Low52" (very concrete), "Canonical" (completely opaque), and "Compacted Canonical" (partly concrete, partly opaque) just seems to invite confusion for the user about what they should use - the names here are about the implementation, not the end use. Treating all of the formats as opaque helps to ensure that they are used only with appropriate functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'm convinced to keep things opaque. And now I'm considering getting rid of "L1", as I don't see any use cases. L1 was more of a by-product of me figuring out how to get things working. As far as the algorithms are concerned, they won't care if a set is compacted or not as long as it is canonical. Because of that, I might also avoid a separate type for "L3" and just leave it to the user to keep track of whether a set is compacted. (I also don't think there's an obvious/easy test for if a set is compacted; you basically just have to run through the compact logic again and check that there are no changes.) With that in mind, what would you think of something like this (modulo names): typedef struct {
H3Index *cells;
int64_t numCells;
} CellArray;
bool isCanonicalSet(CellArray A);
H3Error toCanonicalSet(CellArray *A);
// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error canonicalCompact(CellArray *A);
// functions below work on canonical; are faster on canonical compacted.
bool setContains(CellArray A, H3Index h);
bool doSetsIntersect(CellArray A, CellArray B);
bool isSubset(CellArray A, CellArray B);
H3Error setIntersection(CellArray A, CellArray B, CellArray *C);
H3Error setUnion(CellArray A, CellArray B, CellArray *C); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those look good to me! For clarity, the set functions only work if the set is canonical, right? I'd love some way to enforce this at a type level, e.g. instead of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you'd have: typedef struct {
H3Index *cells;
int64_t numCells;
} CanonicalSet;
bool isCanonicalSet(H3Index *cells, int64_t numCells);
H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);
// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error toCompactCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);
// functions below work on canonical; are faster on canonical compacted.
bool setContains(CanonicalSet A, H3Index h);
bool doSetsIntersect(CanonicalSet A, CanonicalSet B);
bool isSubset(CanonicalSet A, CanonicalSet B);
H3Error setIntersection(CanonicalSet A, CanonicalSet B, CanonicalSet *out);
H3Error setUnion(CanonicalSet A, CanonicalSet B, CanonicalSet *out); A couple of questions here:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. We could write up versions that work on sorted but not canonical sets, but I don't think it is worth it. They're mostly the same; the non-canonical but sorted sets just introduce a few extra annoying edge cases you have to consider.
I was thinking about this too. My hesitation (and why I originally wrote it as
A set of cells is different from its compact representation (that compact representation could be uncompacted to multiple different resolutions, for example). And if users want to uniquely identify a set of cells with a hash, I think we still want to provide them with a way to get a canonical representation of any set of cells. I'm imagining situations where uncompacted sets of cells are efficiently stored as a tuple
Agreed. But it is actually the sum of the set sizes in both cases, and I was thinking that was easy enough to remember for now. But you're probably right in that we should provide functions so users don't need to know that. And it's the sum, even for intersection, because things get weird when you start working with compact canonical sets. (Maybe actually, the worst-case bound is the sum of the set sizes minus 2?) And it might be possible to have a slightly expensive function that computes the exact intersection size so you could allocate the exact amount of space needed, but that would result in basically running the intersection algorithm twice (I think). But maybe that's worth it? |
||
|
||
An H3 cell array `a` is "low52 ordered" if its elements are such that | ||
|
||
- `cmpLow52(a[i-1], a[i]) <= 0` or, equivalently, | ||
- `cmpCanon(a[i-1], a[i]) <= 0`. | ||
|
||
Note that in this classification, arrays can have duplicated cell. We can also | ||
have the parents, children, ancestors, or descendants of other cells in | ||
the array. | ||
|
||
### Canonical | ||
|
||
We'll define a "canonical" H3 cell array to be one that is low52 ordered and | ||
has the additional property that no duplicates, parents, children, ancestors, | ||
or descendants of other cells are in the array. | ||
|
||
We can check this property by ensuring that | ||
|
||
```c | ||
cmpCanon(a[i-1], a[i]) == -2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, naming - maybe just "cmpOrdered" and "cmpOrderedSet"? Either way, I'd prefer |
||
``` | ||
|
||
for each adjacent pair of cells in the array. | ||
|
||
### Compacted and canonical | ||
|
||
A compacted and canonical H3 set is just what it sounds like. | ||
|
||
Many of the fast spatial join operations will work on canonical sets, but | ||
will be faster on compacted canonical sets. | ||
|
||
## Proposal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: closing
}