Another thing that matters enormously is branch misprediction. For example, super scalar sample sort beats GNU's (stupefyingly fast) std::sort by 35-50% on average. I benchmarked skewed quicksort vs std::sort vs super scalar sample sort in a totally artificial setting (sorting a permutation of 0 to n-1) for a lecture: https://github.com/lorenzhs/quicksort-pivot-imbalance
(Repository does not contain the super scalar sample sort code, I don't have the rights to publish that unfortunately. I used the authors' old Pentium 4 optimized code from the paper linked at the bottom (incl PDF).)
Edit: super scalar sample sort also has large data locality benefits over quicksort, the "wavy" line for ssssort shows the caches' impact. I haven't studied that aspect in detail yet.
Very cool. In my experience, using a skewed pivot doesn't actually help in practice. Outside of the case you mention (where the input is a random permutation), sampling must be used to choose a skewed pivot, which has a fairly significant overhead.
But using a skewed pivot is not my point. It would not be particularly clever to respond to the problem of branch mispredicitions by doing suboptimal things. The solution is using algorithms that avoid hard to predict branching and exploit data locality. Thus, super scalar sample sort, which is significantly faster than skewed quicksort even in the case where quicksort gets the pivot for free.
Regarding overhead for sampling -- super scalar sample sort samples sqrt(n) elements, sorts them, and picks a fixed number of pivots from them (e.g. 256 in the implementation I used). But that is worth it, as no branches are needed to find in which "bucket" (between which two pivots) to put an element, using predicated instructions. So the overhead is not the problem, and as you can see in the other plots (in /plots), ssssort executes more instructions. It is still significantly faster than the other algorithms because it exploits the properties of the CPU.
Sure. I didn't actually mean ssssort. Sometimes it's worth distinguishing between comparison based sorts, and those that make use of key structure, assume a fixed size key universe etc For that reason, I think considering sampling overhead is worth thinking about.
(Repository does not contain the super scalar sample sort code, I don't have the rights to publish that unfortunately. I used the authors' old Pentium 4 optimized code from the paper linked at the bottom (incl PDF).)
Edit: super scalar sample sort also has large data locality benefits over quicksort, the "wavy" line for ssssort shows the caches' impact. I haven't studied that aspect in detail yet.