Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Spatial hashing.

Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.

Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.



TBH even quadkeys are a fun answer to OPs question, many people aren't aware of them.

Simple explanation:

If you have data with x y coordinates and you know the bounds.

To compute the quad key for a point:

1. The key starts as the empty string. (I've also seen it start with "Z" to handle points outside the bounds)

2. Divide the space into 4 quadrants

3. determine which quadrant the point falls in, append a letter (A-D depending on the quadrant) to the key

4. Repeat step 3 using the quadrant bounds (i.e. recursively smaller area) until you have desired accuracy

This can then be used to efficiently find points within rectangular bounds.


A lot of people don't know about the related segment and interval trees. Both efficiently allow you to compute all time ranges that overlap a point. This can be generalized to higher dimensional spaces.

For example if you have a set of calendar invites with a start and stop time (say tens of millions of them) and you want to find all invites which span 12pm (that is start before 12 and end after 12) you want an interval tree.

https://en.wikipedia.org/wiki/Interval_tree


That's how mongodb geospatial indexes work IIRC


Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.

In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.

A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.

When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):

1. GeoIntersects on the extent field (uses the index).

2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)

[1] https://www.mongodb.com/docs/manual/tutorial/create-queries-...


Seems exactly like broad phase and narrow phase in games physics engine.


The same things get invented over and over again and named different things depending on the field. Sometimes it's not immediately clear that they are the same things mathematically.


It's tried and tested for sure, I first encountered them in 94 but I assume they're much older.

A little sleuthing shows that (TIL) they are an application of Z-order curves, which date back to 1906


Hmm... the quad keys end up looking very much like coordinates. In the regular coordinates, the most significant bit divides the space in half, then the next bit divides those spaces in halves etc...

You could get a single "key" by just having a sequence of bits with most significant bit of x coordinate, most significant bit if of y coordinate, second most significant bit of x coordinate, second most significant bit of y coordinate, third....


From a basic binary search, this is very intuitive. Your explanation was well written.


It should be noted that this solution completely fails when the data is anything like real world location data which tends to be clustered in cities, stadiums etc. The strength of quad trees is precisely the fact that the "Alaska" quadrant can have a single child for that crazy outdoors-man using your gadget, while New York can have a million children trees down to every house on the street.


I’ve always found Morton encoding to be a cool approach. There’s an old but good blog post about this: https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...

This looks like a more modern implementation: https://crates.io/crates/lindel


I didn't think that would be an obscure data structure, but Locality-sensitive hashing [1] helps in so many cases. Like nearest neighbour search, collision detection, etc.

[1]: https://en.wikipedia.org/wiki/Locality-sensitive_hashing


Neat use of these is the "fast multipole method" for stimulating galaxies.

If you want to compute the gravitational force on a given star exactly you need to figure out the forces from all other stars and add them up. This will be n^2 and slow. But turns out you can get a good approximation by dividing up space with a kd tree and using the average force from each cell to represent the stars underneath it in the tree. This gets you to nlogn

https://en.m.wikipedia.org/wiki/Fast_multipole_method


Ever heard of geospatial hierarchical Indices?

E.g. Uber's H3 index or Google's S2 index.


I have a very optimized octree, but one thing I can not use it for is points in space. Mine is built with the assumptions that every object has a finite size, which is used to determine what level of the tree it lives in. Points fail because they have zero size.

I'm still looking for a good spatial index for points. I'll think about yours.


Are these similar to space-filling curves like z-order indices? https://en.wikipedia.org/wiki/Z-order

Some more such curves: https://web.archive.org/web/20220120151929/https://citeseerx...


A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).

These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.


Couple some spatial hashing with Morton codes and fast parallel Radix sorting and some binary search derivative and you can do all sorts of fancy queries for collision detection and graphics stuff.


Yeah, just reinvented it a few weeks ago and was like “whoa, that’s it?” It worked perfectly for my fast clustering use case.


> reasonably well behaved

Curious what this means in this context. No hot spots? Not sparse?


Performance degrades when you get many hash collisions, which will happen if you have a lot of points clumped together in space.

Sparsity is fine.

For points you expect to be unevenly distributed the more advanced spatial partitioning data structures (kd tree, BVH) perform better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: