Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonicalization for cell arrays #552

Draft
wants to merge 35 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a2a61f8
function declarations
ajfriend Dec 22, 2021
7ce1842
rfc
ajfriend Dec 22, 2021
ddeaf42
types of arrays
ajfriend Dec 22, 2021
1c741bd
merp
ajfriend Dec 22, 2021
f1a6e9a
implementations
ajfriend Dec 23, 2021
d372101
intersection test
ajfriend Dec 23, 2021
b9d5af2
better with bool
ajfriend Dec 23, 2021
f87297b
wayLessThan
ajfriend Dec 23, 2021
8c90200
ensureASmaller
ajfriend Dec 23, 2021
d7bf43e
ternary
ajfriend Dec 23, 2021
cbf69b3
merp
ajfriend Dec 23, 2021
8b07372
notes
ajfriend Dec 23, 2021
81d7e2f
intersectTheyDo_slow
ajfriend Dec 23, 2021
b55c886
formatting
ajfriend Dec 23, 2021
6368edf
add some tests
ajfriend Dec 25, 2021
34e6bf2
trying some stuff
ajfriend Dec 25, 2021
04c853d
clean up tests
ajfriend Dec 25, 2021
0355ce7
ring_intersect
ajfriend Dec 25, 2021
39cb4e0
overlapping disks
ajfriend Dec 25, 2021
fc2da34
h3api.h
ajfriend Dec 25, 2021
bec6d6b
oops, wrong one
ajfriend Dec 25, 2021
8076474
try again
ajfriend Dec 25, 2021
0e810f3
H3_EXPORT might be the trick
ajfriend Dec 25, 2021
5a6c581
H3_EXPORT all the things
ajfriend Dec 25, 2021
948d2e8
one last straggler
ajfriend Dec 25, 2021
201c6c1
some clean up
ajfriend Dec 25, 2021
24f6e2b
cleaner
ajfriend Dec 25, 2021
26d02dc
trying out t_isLow52 and t_isCanon
ajfriend Dec 25, 2021
a9e1777
t_intersect
ajfriend Dec 25, 2021
3eed379
clean up helper functions
ajfriend Dec 25, 2021
d624f97
sets use capital letters
ajfriend Dec 25, 2021
5475831
t_intersects
ajfriend Dec 26, 2021
ea81e27
do some simpler flipping between left and right side of A
ajfriend Dec 26, 2021
6e3fcaf
simplify input for disjointInsertionPoint
ajfriend Dec 26, 2021
428854c
tricky ring tests
ajfriend Dec 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,8 @@ set(LIB_SOURCE_FILES
src/h3lib/lib/iterators.c
src/h3lib/lib/vertexGraph.c
src/h3lib/lib/faceijk.c
src/h3lib/lib/baseCells.c)
src/h3lib/lib/baseCells.c
src/h3lib/lib/low52.c)
set(APP_SOURCE_FILES
src/apps/applib/include/kml.h
src/apps/applib/include/benchmark.h
Expand Down Expand Up @@ -210,6 +211,7 @@ set(OTHER_SOURCE_FILES
src/apps/testapps/testCoordIjk.c
src/apps/testapps/testH3Memory.c
src/apps/testapps/testH3Iterators.c
src/apps/testapps/testLow52.c
src/apps/miscapps/cellToBoundaryHier.c
src/apps/miscapps/cellToLatLngHier.c
src/apps/miscapps/generateBaseCellNeighbors.c
Expand Down Expand Up @@ -607,6 +609,7 @@ if(H3_IS_ROOT_PROJECT AND BUILD_TESTING)
add_h3_test(testBaseCells src/apps/testapps/testBaseCells.c)
add_h3_test(testPentagonIndexes src/apps/testapps/testPentagonIndexes.c)
add_h3_test(testH3Iterators src/apps/testapps/testH3Iterators.c)
add_h3_test(testLow52 src/apps/testapps/testLow52.c)

add_h3_test_with_arg(testH3NeighborRotations src/apps/testapps/testH3NeighborRotations.c 0)
add_h3_test_with_arg(testH3NeighborRotations src/apps/testapps/testH3NeighborRotations.c 1)
Expand Down
131 changes: 131 additions & 0 deletions dev-docs/RFCs/canonicalization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# RFC: Canonicalization for H3 cell sets

* **Authors**: AJ Friend
* **Date**: 2021-12-122
* **Status**: Draft

## Abstract

We propose a canonical form for **sets** of H3 cells based on
the "lower 52 bit" ordering. We also introduce fast "spatial join" operations
on the cell sets (like cell-in-set, intersection, union, etc.) that exploit
the canonical structure for speed gains.


## Motivation

A canonical form for cell sets is useful when testing if two sets are equal.
That is, we'd like to be able to tell if two H3 cell arrays represent
the same mathematical set of cells, ignoring ordering or duplicated cells.

If we have a function to canonicalize an H3 cell array, then we would
consider two arrays to be equivalent (as sets) if they each canonicalize
to the exact same (canonical) cell array.

A canonical form is also useful if a user wanted to deterministically
hash H3 cell sets, and wanted the hash to be independent of ordering
or duplicates.

The canonical form we'll propose also has the added benefit of allowing
for fast "spatial join" operations on canonicalized sets. For example,
we'll be able to do a fast binary search to see if a cell is a member
of a set, and an **even faster** binary search if the set is both
compacted and canonicalized.

We'll get the same benefits computing the intersection of sets, or
simply testing for intersection.

The canonical form also suggests a new "in-memory" cell compaction algorithm,
which avoids any dynamic memory allocation. This new compact algorithm
has the added benefit of returning cell arrays already in canonical form.

## Terminology

We propose a canonical form based on the "lower 52 bit" ordering, that is,
the ordering you would get if you only considered the lower 52 bits of the
H3 cell indexes. The lower 52 bits of an H3 index consist of 7 bits for the
base cell, and 3 bits for each of the 15 resolution digits. That sums up
to `7 + 3*15 = 52`.

We'll only define this ordering for H3 **cells**. We're not considering
vertices or edges in this RFC.

The lower 52 bit ordering can be implemented, for example, by
the `cmpLow52` comparison function given below.


```c
int cmpLow52(H3Index a, H3Index b) {
a <<= 12;
b <<= 12;

if (a < b) return -1;
if (a > b) return +1;
return 0;
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: closing }


This ordering has the property that children cells are always less than
their parent cells. Ordered in an array with cells of multiple resolutions,
children cells are always to the left of their parents.

We can also get slightly richer ordering information with a comparison function
with a declaration like

```c
int cmpCanon(H3Index a, H3Index b);
```

defined so that:

- `cmpCanon(a, b) == 0` if `a` and `b` are the same cell
- `cmpCanon(a, b) == -1` if `a` is a child (or further descendant) of `b`
- `cmpCanon(a, b) == +1` if `b` ... `a`
- `cmpCanon(a, b) == -2` if `a` < `b` in the low52 ordering, but they are not related
- `cmpCanon(a, b) == +2` if `b` < `a` ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this?


Note that these two functions produce the same ordering when given to
the C standard library's `qsort`.

### Array classifications

Given these comparison functions, we can define 3 increasingly strict properties
on arrays of H3 cells:

1. "lower 52" ordered
2. canonical
3. compacted and canonical

#### Low52 ordered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not keen on this name. Maybe "cell digit ordered" or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid just using "ordered", since there are potentially multiple different orderings someone might do, like the standard uint64_t ordering, for example.

What don't you like about lower 52? :) I liked that it was a very distinct name, so it was immediately clear to a reader that you're talking about a specific concept. It's also easier to code search. I'm worried that "cell digit ordered" is a bit too generic of a name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer the actual ordering to be opaque to the end user - the idea is:

  • We have a special, somewhat opaque format for sets of H3 indexes, and if you use it, you get access to these set functions.
  • We have 3 levels of canonicalization, L1, L2, L3. Each one is more expensive to apply than the last, but the subsequent runtime of the set functions is faster.

Beyond that, the user shouldn't care. Calling this "Low52" (very concrete), "Canonical" (completely opaque), and "Compacted Canonical" (partly concrete, partly opaque) just seems to invite confusion for the user about what they should use - the names here are about the implementation, not the end use. Treating all of the formats as opaque helps to ensure that they are used only with appropriate functions.

Copy link
Contributor Author

@ajfriend ajfriend Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm convinced to keep things opaque. And now I'm considering getting rid of "L1", as I don't see any use cases. L1 was more of a by-product of me figuring out how to get things working.

As far as the algorithms are concerned, they won't care if a set is compacted or not as long as it is canonical. Because of that, I might also avoid a separate type for "L3" and just leave it to the user to keep track of whether a set is compacted. (I also don't think there's an obvious/easy test for if a set is compacted; you basically just have to run through the compact logic again and check that there are no changes.)

With that in mind, what would you think of something like this (modulo names):

typedef struct {
    H3Index *cells;
    int64_t numCells;
} CellArray;

bool isCanonicalSet(CellArray A);
H3Error toCanonicalSet(CellArray *A);

// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error canonicalCompact(CellArray *A);

// functions below work on canonical; are faster on canonical compacted.
bool setContains(CellArray A, H3Index h);
bool doSetsIntersect(CellArray A, CellArray B);
bool isSubset(CellArray A, CellArray B);
H3Error setIntersection(CellArray A, CellArray B, CellArray *C);
H3Error setUnion(CellArray A, CellArray B, CellArray *C);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those look good to me! For clarity, the set functions only work if the set is canonical, right? I'd love some way to enforce this at a type level, e.g. instead of CellArray call it CanonicalSet, then take the args H3Index *cells, int64_t numCells for the functions that don't have this requirement (isCanonicalSet, toCanonicalSet, canonicalCompact -- which BTW I'd call toCompactCanonicalSet). That way the user has to either pass their array through toCanonicalSet or toCompactCanonicalSet in order to use the set functions, or at least they need to explicitly create a CanonicalSet themselves, affirming that their input is canonical.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd have:

typedef struct {
    H3Index *cells;
    int64_t numCells;
} CanonicalSet;

bool isCanonicalSet(H3Index *cells, int64_t numCells);
H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error toCompactCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

// functions below work on canonical; are faster on canonical compacted.
bool setContains(CanonicalSet A, H3Index h);
bool doSetsIntersect(CanonicalSet A, CanonicalSet B);
bool isSubset(CanonicalSet A, CanonicalSet B);
H3Error setIntersection(CanonicalSet A, CanonicalSet B, CanonicalSet *out);
H3Error setUnion(CanonicalSet A, CanonicalSet B, CanonicalSet *out);

A couple of questions here:

  • If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.
  • In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those look good to me! For clarity, the set functions only work if the set is canonical, right?

Yes. We could write up versions that work on sorted but not canonical sets, but I don't think it is worth it. They're mostly the same; the non-canonical but sorted sets just introduce a few extra annoying edge cases you have to consider.

H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

I was thinking about this too. My hesitation (and why I originally wrote it as H3Error toCanonicalSet(CellArray *A)) was that out would point to the same memory as cells (since the operation is in-place). Maybe not a big deal, but I worry about issues that come up with multiple references to the same memory, like double calls to free.

  • If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.

A set of cells is different from its compact representation (that compact representation could be uncompacted to multiple different resolutions, for example). And if users want to uniquely identify a set of cells with a hash, I think we still want to provide them with a way to get a canonical representation of any set of cells.

I'm imagining situations where uncompacted sets of cells are efficiently stored as a tuple (compacted set id, uncompact resolution), and we'd want a hash that distinguishes the compacted and uncompacted sets.

  • In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Agreed. But it is actually the sum of the set sizes in both cases, and I was thinking that was easy enough to remember for now. But you're probably right in that we should provide functions so users don't need to know that.

And it's the sum, even for intersection, because things get weird when you start working with compact canonical sets.
For example, the intersection of the first two sets (in the sense we're talking about) here is the third:

Screen Shot 2022-01-05 at 9 50 12 PM

Screen Shot 2022-01-05 at 9 56 52 PM

Screen Shot 2022-01-05 at 9 50 21 PM

(Maybe actually, the worst-case bound is the sum of the set sizes minus 2?)

And it might be possible to have a slightly expensive function that computes the exact intersection size so you could allocate the exact amount of space needed, but that would result in basically running the intersection algorithm twice (I think). But maybe that's worth it?


An H3 cell array `a` is "low52 ordered" if its elements are such that

- `cmpLow52(a[i-1], a[i]) <= 0` or, equivalently,
- `cmpCanon(a[i-1], a[i]) <= 0`.

Note that in this classification, arrays can have duplicated cell. We can also
have the parents, children, ancestors, or descendants of other cells in
the array.

### Canonical

We'll define a "canonical" H3 cell array to be one that is low52 ordered and
has the additional property that no duplicates, parents, children, ancestors,
or descendants of other cells are in the array.

We can check this property by ensuring that

```c
cmpCanon(a[i-1], a[i]) == -2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, naming - maybe just "cmpOrdered" and "cmpOrderedSet"?

Either way, I'd prefer Canonical to Canon (I think of a "canon" in this context as being a set of things, e.g "the Shakespeare canon" is the set of recognized works, each of which is canonical)

```

for each adjacent pair of cells in the array.

### Compacted and canonical

A compacted and canonical H3 set is just what it sounds like.

Many of the fast spatial join operations will work on canonical sets, but
will be faster on compacted canonical sets.

## Proposal

Loading