If you care about performance, it's important to note how the underlying represe...

tjalfi · on June 26, 2017

Chandler Carruth talks about unordered_map performance[1] in his CppCon 2014 talk "Efficiency with Algorithms, Performance with Data Structures". He endorses most of the other alternatives you mentioned.

[1] https://youtu.be/fHNmRkzxHWs?t=46m39s

obstinate · on June 26, 2017

> sorted vector of keys . . . reserve necessary memory and sort it . . . unsorted coindexed arrays . . .

Most of the things you mentioned are not hash tables, but members of a parent concept, dictionaries. Hash tables all by definition involve some sort of hashing of the key. The two main categories of hash table are chained hash tables (std::unordered_map does this, at least in the implementations I'm aware of) and open addressed hash tables, which use probing instead of secondary data structures to resolve conflicts.

VHRanger · on June 26, 2017

You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values. It still keeps most of the properties we like, especially if doing the coindex-sort operation is too expensive for some reason.

obstinate · on June 26, 2017

> You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values

So you're saying you're going to hash the keys, then sort them according to the hash, with tie breaking on the key itself? I'm not aware of any sorted table that does this, but I'm sure some exist. I suppose you'd get something of a win if N was large, and the keys had long common prefixes, and you didn't care about the ordering property.

But in that case you'd probably use an actual hash table, not the algorithm you just described. Unless there's something I'm missing.

VHRanger · on June 26, 2017

Sorry I misspoke. Lookup is always O(1) in a hash table. But this could be a "weak dict" that is you don't actually store the keys, as long as you can reference a key through a hash you can lookup.

In the proper "data structure", we usually store the key and the value -- iteration through keys and/or values and/or pairs are probably supported operations.

Finding a key in the structure (either with binary tree search, or binary search on a sorted array, or linear lookup on an array) varies. So does iteration and most other operations.

Eridrus · on June 26, 2017

It's an interesting fiction that we tell everyone that hash tables are O(1), when in reality hash operations are typically O(n) in the key size, which is O(log n) in the value space, it's just much cheaper to have your Own(log n) operations be reading sequential bits, rather than following pointers, reading whole keys, etc.

Probably not something people usually run into, but it does show that constants matter.

ramchip · on June 26, 2017

> a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array

This explanation confuses me. Is there one or two arrays? What does "coindex" mean?

dom0 · on June 26, 2017

Equal indices refer to the same entity in all places. I.e. two arrays, one for the key, one for the value. The value of the key at index n in the key array is found at the same index n in the values array.

YZF · on June 26, 2017

You mean insertion into a bucket when there's a collision? And your n is the number of entries in the list? But n should always be very small relative to the total number of items in the hash map such that overall insertion and lookups are O(1).

VHRanger · on June 26, 2017

I mean insertion into the dictionary in general.

obstinate · on June 26, 2017

I think you might be a little confused. Even in hash tables with chaining, one does not tend to spend much time traversing linked lists, because the typical bucket will have only one member. This depends on the load factor of the table, but most practical implementations will eventually rebucket and rehash if their load factor grows too high. "Getting the key loaded" -- i.e. finding the memory location that contains the key -- is O(1) on average in all practical hash tables. It does not typically require any traversal at all.

You keep on talking about ordered tables like red black trees, as in this comment, which is another sign that makes me wonder if you might be confused.

arethuza · on June 26, 2017

The early versions of String.hashCode in java only looked at the first 16 characters of the string - when used for things like URLs this led to rather poor performance!

https://en.wikipedia.org/wiki/Java_hashCode()#The_java.lang....

misja111 · on June 26, 2017

Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1). And the same applies for cache misses. All these things don't really matter except if you are using hash tables with not very many elements in them.

VHRanger · on June 26, 2017

> Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1).

If you mean in a chained hash table, where you chase up to 100 pointers to get the value, the performance is atrocious.

Friendly reminder: traversal of a linked list and a contiguous array are both O(n). In the real world one is two orders of magnitude faster than the other.

> All these things don't really matter except if you are using hash tables with not very many elements in them.

The lookup of a value given a key is probably the least affected operation. If all you care about is a "weak dictionary" (you only really need to store values and lookup from keys), all of this is mostly jabber. If you store the keys to iterate through, or access for some reason, all those things start to matter a whole lot.

panic · on June 26, 2017

For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast.

Has this changed recently? Last time I used unordered_map, insertion was also slow because it had to allocate a linked list entry for every item in the hash table.

VHRanger · on June 26, 2017

No it's part of the standard. The STL guys can't implement it differently.

Here are a few alternatives:

MCT closed_hash_map

Sparsehash's sparse_hash_map or dense_hash_map

loki::assocvector

boost::flat_hash_map