Spatial hashing. Say that you have data that is identified with points in 2D or ...

pacaro · on July 22, 2022

TBH even quadkeys are a fun answer to OPs question, many people aren't aware of them.

Simple explanation:

If you have data with x y coordinates and you know the bounds.

To compute the quad key for a point:

1. The key starts as the empty string. (I've also seen it start with "Z" to handle points outside the bounds)

2. Divide the space into 4 quadrants

3. determine which quadrant the point falls in, append a letter (A-D depending on the quadrant) to the key

4. Repeat step 3 using the quadrant bounds (i.e. recursively smaller area) until you have desired accuracy

This can then be used to efficiently find points within rectangular bounds.

zbobet2012 · on July 22, 2022

A lot of people don't know about the related segment and interval trees. Both efficiently allow you to compute all time ranges that overlap a point. This can be generalized to higher dimensional spaces.

For example if you have a set of calendar invites with a start and stop time (say tens of millions of them) and you want to find all invites which span 12pm (that is start before 12 and end after 12) you want an interval tree.

https://en.wikipedia.org/wiki/Interval_tree

yellowflash · on July 22, 2022

That's how mongodb geospatial indexes work IIRC

soren1 · on July 22, 2022

Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.

In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.

A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.

When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):

1. GeoIntersects on the extent field (uses the index).

2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)

[1] https://www.mongodb.com/docs/manual/tutorial/create-queries-...

hubadu · on July 22, 2022

Seems exactly like broad phase and narrow phase in games physics engine.

djmips · on July 22, 2022

The same things get invented over and over again and named different things depending on the field. Sometimes it's not immediately clear that they are the same things mathematically.

pacaro · on July 22, 2022

It's tried and tested for sure, I first encountered them in 94 but I assume they're much older.

A little sleuthing shows that (TIL) they are an application of Z-order curves, which date back to 1906

Gravityloss · on July 22, 2022

Hmm... the quad keys end up looking very much like coordinates. In the regular coordinates, the most significant bit divides the space in half, then the next bit divides those spaces in halves etc...

You could get a single "key" by just having a sequence of bits with most significant bit of x coordinate, most significant bit if of y coordinate, second most significant bit of x coordinate, second most significant bit of y coordinate, third....

blueblob · on July 22, 2022

From a basic binary search, this is very intuitive. Your explanation was well written.

manholio · on July 22, 2022

It should be noted that this solution completely fails when the data is anything like real world location data which tends to be clustered in cities, stadiums etc. The strength of quad trees is precisely the fact that the "Alaska" quadrant can have a single child for that crazy outdoors-man using your gadget, while New York can have a million children trees down to every house on the street.

chatmasta · on July 22, 2022

I’ve always found Morton encoding to be a cool approach. There’s an old but good blog post about this: https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...

This looks like a more modern implementation: https://crates.io/crates/lindel

TheFragenTaken · on July 22, 2022

I didn't think that would be an obscure data structure, but Locality-sensitive hashing [1] helps in so many cases. Like nearest neighbour search, collision detection, etc.

[1]: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

rcpt · on July 23, 2022

Neat use of these is the "fast multipole method" for stimulating galaxies.

If you want to compute the gravitational force on a given star exactly you need to figure out the forces from all other stars and add them up. This will be n^2 and slow. But turns out you can get a good approximation by dividing up space with a kd tree and using the average force from each cell to represent the stars underneath it in the tree. This gets you to nlogn

https://en.m.wikipedia.org/wiki/Fast_multipole_method

de6u99er · on July 22, 2022

Ever heard of geospatial hierarchical Indices?

E.g. Uber's H3 index or Google's S2 index.

phkahler · on July 22, 2022

I have a very optimized octree, but one thing I can not use it for is points in space. Mine is built with the assumptions that every object has a finite size, which is used to determine what level of the tree it lives in. Points fail because they have zero size.

I'm still looking for a good spatial index for points. I'll think about yours.

ignoramous · on July 22, 2022

Are these similar to space-filling curves like z-order indices? https://en.wikipedia.org/wiki/Z-order

Some more such curves: https://web.archive.org/web/20220120151929/https://citeseerx...

cpgxiii · on July 22, 2022

A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).

These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.

exDM69 · on July 22, 2022

Couple some spatial hashing with Morton codes and fast parallel Radix sorting and some binary search derivative and you can do all sorts of fancy queries for collision detection and graphics stuff.

atemerev · on July 22, 2022

Yeah, just reinvented it a few weeks ago and was like “whoa, that’s it?” It worked perfectly for my fast clustering use case.

klysm · on July 22, 2022

> reasonably well behaved

Curious what this means in this context. No hot spots? Not sparse?

evouga · on July 22, 2022

Performance degrades when you get many hash collisions, which will happen if you have a lot of points clumped together in space.

Sparsity is fine.

For points you expect to be unevenly distributed the more advanced spatial partitioning data structures (kd tree, BVH) perform better.