Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you care about performance, it's important to note how the underlying representation of a dictionary data structure.

For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast. Retrieving a key is O(logn) but a very slow O(logn), because of the cache misses.

Other implementations is to keep a sorted vector of keys and a respective vector of values. There's loki::assocvector, booost::flat_unordered_map that do this for instance. Now insertion is slow, but iteration and retrieval are very fast (a fast O(logn) by binary search with few cache misses). It's also memory efficient, since no pointers between elements.

If you have one big dictionary you would use throughout an application, and know the data up front, a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array. Now you have a memory efficient and extremely fast dictionary data structure.

One other strategy any intermediate coder can implement is to have two unsorted coindexed arrays. You don't even need a hash function for this. Now iterating and insertion through the table is extremely fast, and it is memory efficient, but finding a key is just a fast O(n). So this is good for smaller tables. In C++ you could implement it as a std::pair<vector<key>, vector<value>>. If you need a quick small map in a function this is often the fastest data structure you can implement without too many headaches.



Chandler Carruth talks about unordered_map performance[1] in his CppCon 2014 talk "Efficiency with Algorithms, Performance with Data Structures". He endorses most of the other alternatives you mentioned.

[1] https://youtu.be/fHNmRkzxHWs?t=46m39s


> sorted vector of keys . . . reserve necessary memory and sort it . . . unsorted coindexed arrays . . .

Most of the things you mentioned are not hash tables, but members of a parent concept, dictionaries. Hash tables all by definition involve some sort of hashing of the key. The two main categories of hash table are chained hash tables (std::unordered_map does this, at least in the implementations I'm aware of) and open addressed hash tables, which use probing instead of secondary data structures to resolve conflicts.


You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values. It still keeps most of the properties we like, especially if doing the coindex-sort operation is too expensive for some reason.


> You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values

So you're saying you're going to hash the keys, then sort them according to the hash, with tie breaking on the key itself? I'm not aware of any sorted table that does this, but I'm sure some exist. I suppose you'd get something of a win if N was large, and the keys had long common prefixes, and you didn't care about the ordering property.

But in that case you'd probably use an actual hash table, not the algorithm you just described. Unless there's something I'm missing.


Sorry I misspoke. Lookup is always O(1) in a hash table. But this could be a "weak dict" that is you don't actually store the keys, as long as you can reference a key through a hash you can lookup.

In the proper "data structure", we usually store the key and the value -- iteration through keys and/or values and/or pairs are probably supported operations.

Finding a key in the structure (either with binary tree search, or binary search on a sorted array, or linear lookup on an array) varies. So does iteration and most other operations.


It's an interesting fiction that we tell everyone that hash tables are O(1), when in reality hash operations are typically O(n) in the key size, which is O(log n) in the value space, it's just much cheaper to have your Own(log n) operations be reading sequential bits, rather than following pointers, reading whole keys, etc.

Probably not something people usually run into, but it does show that constants matter.


> a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array

This explanation confuses me. Is there one or two arrays? What does "coindex" mean?


Equal indices refer to the same entity in all places. I.e. two arrays, one for the key, one for the value. The value of the key at index n in the key array is found at the same index n in the values array.


You mean insertion into a bucket when there's a collision? And your n is the number of entries in the list? But n should always be very small relative to the total number of items in the hash map such that overall insertion and lookups are O(1).


I mean insertion into the dictionary in general.


I think you might be a little confused. Even in hash tables with chaining, one does not tend to spend much time traversing linked lists, because the typical bucket will have only one member. This depends on the load factor of the table, but most practical implementations will eventually rebucket and rehash if their load factor grows too high. "Getting the key loaded" -- i.e. finding the memory location that contains the key -- is O(1) on average in all practical hash tables. It does not typically require any traversal at all.

You keep on talking about ordered tables like red black trees, as in this comment, which is another sign that makes me wonder if you might be confused.


The early versions of String.hashCode in java only looked at the first 16 characters of the string - when used for things like URLs this led to rather poor performance!

https://en.wikipedia.org/wiki/Java_hashCode()#The_java.lang....


Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1). And the same applies for cache misses. All these things don't really matter except if you are using hash tables with not very many elements in them.


> Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1).

If you mean in a chained hash table, where you chase up to 100 pointers to get the value, the performance is atrocious.

Friendly reminder: traversal of a linked list and a contiguous array are both O(n). In the real world one is two orders of magnitude faster than the other.

> All these things don't really matter except if you are using hash tables with not very many elements in them.

The lookup of a value given a key is probably the least affected operation. If all you care about is a "weak dictionary" (you only really need to store values and lookup from keys), all of this is mostly jabber. If you store the keys to iterate through, or access for some reason, all those things start to matter a whole lot.


For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast.

Has this changed recently? Last time I used unordered_map, insertion was also slow because it had to allocate a linked list entry for every item in the hash table.


No it's part of the standard. The STL guys can't implement it differently.

Here are a few alternatives:

MCT closed_hash_map

Sparsehash's sparse_hash_map or dense_hash_map

loki::assocvector

boost::flat_hash_map




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: