Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.
Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.
A lot of people don't know about the related segment and interval trees. Both efficiently allow you to compute all time ranges that overlap a point. This can be generalized to higher dimensional spaces.
For example if you have a set of calendar invites with a start and stop time (say tens of millions of them) and you want to find all invites which span 12pm (that is start before 12 and end after 12) you want an interval tree.
Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.
In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.
A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.
When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):
1. GeoIntersects on the extent field (uses the index).
2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)
The same things get invented over and over again and named different things depending on the field. Sometimes it's not immediately clear that they are the same things mathematically.
Hmm... the quad keys end up looking very much like coordinates. In the regular coordinates, the most significant bit divides the space in half, then the next bit divides those spaces in halves etc...
You could get a single "key" by just having a sequence of bits with most significant bit of x coordinate, most significant bit if of y coordinate, second most significant bit of x coordinate, second most significant bit of y coordinate, third....
It should be noted that this solution completely fails when the data is anything like real world location data which tends to be clustered in cities, stadiums etc. The strength of quad trees is precisely the fact that the "Alaska" quadrant can have a single child for that crazy outdoors-man using your gadget, while New York can have a million children trees down to every house on the street.
I didn't think that would be an obscure data structure, but Locality-sensitive hashing [1] helps in so many cases. Like nearest neighbour search, collision detection, etc.
Neat use of these is the "fast multipole method" for stimulating galaxies.
If you want to compute the gravitational force on a given star exactly you need to figure out the forces from all other stars and add them up. This will be n^2 and slow. But turns out you can get a good approximation by dividing up space with a kd tree and using the average force from each cell to represent the stars underneath it in the tree. This gets you to nlogn
I have a very optimized octree, but one thing I can not use it for is points in space. Mine is built with the assumptions that every object has a finite size, which is used to determine what level of the tree it lives in. Points fail because they have zero size.
I'm still looking for a good spatial index for points. I'll think about yours.
A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).
These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.
Couple some spatial hashing with Morton codes and fast parallel Radix sorting and some binary search derivative and you can do all sorts of fancy queries for collision detection and graphics stuff.
Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.
Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.