This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.
An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)
Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.
Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.
This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).
Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).
Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.
All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.
Both GCC and Clang are surprisingly mediocre at this kind of optimization. I write a lot of extreme performance integer algorithms and those compilers only seem to find "obvious" parallel instruction schedules about half the time even in isolated contexts.
Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.
I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.
> Both GCC and Clang are surprisingly mediocre at this kind of optimization.
Clang, at least, has gotten significantly better in the past year (I haven’t used GCC for a while), but there’s certainly room for improvement. Please report bugs for cases that your compiler misses if you can spare a few minutes.
Not yet but will soon. It is not just one function but an entire family of functions with some interesting aspects beyond just the algorithms.
The hash functions were algorithmically optimized around a scaffolding I designed that guaranteed certain performance characteristics and easy analyzability. It has literally produced many, many thousands of high quality hash functions. Tens of thousands of CPU hours have been burned on the optimization process, which is still running, and the ones that have been put into use so far were early samples pulled out of that process but the hash functions still being processed in the pipeline are statistically more robust than the earlier versions. Much easier than trying to design them the old fashioned way. At some point soon, since the optimization is converging, I will evaluate the most promising parts of the phase space to select the strongest and most aesthetic functions and publish those into the public domain.
This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).
P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.
Non-trivial vectorization is an entirely different sort of optimization than re-association to extract ILP (especially in the integer domain where the vector instructions often have no exact non-vector analogue).
Yes, there are people who have the knowledge and experience to regularly beat the compiler. I do it professionally. My point is not "don't bother optimizing, let the compiler do it". My point is "this particular micro-optimization is less valuable in the real world than focused benchmarks suggest, and good compilers often do it (this one specific optimization) for you, anyway."
Was there a followup to that article? I'm pretty sure the NDA period should be up by now and I'm wondering if the author posted a more in-depth explanation and overview.
Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.
It's a far cry from RISC - like a load/store machine.
All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).
Very very good points!
Relatedly: for any performance sensitive code, reading the relevant version of the Intel Optimization manual + a book like Hacker's Delight will lead to a lot of good understanding of these trick.
(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)
Could you recommend a good reading (a book preferably) for learning about CPU architecture. I'm aware of Intel Software Developer's Manual. What other resources are worth reading?
Computer Architecture: A Quantitative Approach by Hennesey and Patterson is the definitive text, and it is good (at least the 3rd edition I read back in 02).
One way to put the ideas you learn into practice is to try to write an optimized large NxN matrix multiplication routine. Start in C, converting the kernel to x86 code. Also disassemble the generated C code in the kernel to see what the compiler is doing. See how close you can get to the theoretical peak CPU performance.
It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.
The multiplication trick reminds me of this StackOverflow answer[1] where an SMT solver (z3) is used to derive mask and multiplier to extract chosen bits from a byte.
Thinking a little further about this, I believe using PSHUFB is the way to go, at least for when the count is large. This is because we can do 2 iterations in essentially one go (haven't tested the code, it's mostly a sketch):
vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
vmovdqu xmm7, [rdi]
_loop_body:
vpand xmm8, xmm15, [rdi]
vpsrlw xmm9, xmm7, 4
vpand xmm9, xmm9, xmm15
vpshufb xmm8, xmm0, xmm8
vpshufb xmm9, xmm0, xmm9
vpaddb xmm8, xmm8, xmm9
vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
; xmm7[0] = xmm8[xmm8[0]]
vpaddb xmm8, xmm8, xmm7 ; add it
vpextrb eax, xmm8, 0
vmovdqu xmm7, [rdi + rax]
add rdi, rax
sub esi, 2
jnz _loop_body
This is likely extendable to 32-byte vectors with AVX2; have not thought much about that case.
This is exciting, but I think I may have tricked you on a couple of details. The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the maximum offset is actually 17, which means one can't depend on having two keys within a single 16 byte vector. I'm trying to figure out if there is some way to compensate for this without halving the performance.
without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.
Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.
On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.
I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.
I do greatly appreciate your help, though. Thanks!
Thanks for looking at this! This is wonderful and succinct (presuming it works, I'm still staring at it), but the 20+ latency of DIV makes it impractical. I think I can do it with multiply, and, multiply, shift, but even 3 cycle latency for the multiplies is too much:
The author seems to misunderstand the idea of asymptotic complexity. All of the reversal operations are O(1) because the number of bits being flipped is a constant. If he were concerned with flipping the bits in an arbitrary precision number, then his different solutions might deserve "Big-O" classifications.
Second point: The reason that the original solution is slow is because a mod operation by a number that is not a power of two involves a floating point divide, or several multiply accumulates at extended precision. Either of those two operations are slower than any of the other methods.
Interestingly, while x86-64 does not seem to have a single opcode for reversing bits in a byte, it has a function to arbitrarily shuffle around the 16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much data those SIMD instructions process or move around in relatively few clock-cycles.
It’s actually shocking how long it took Intel to add PSHUFB to SSE. Altivec (PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B) way back in 1999.
Like my sibling posted, the crazy CISCy instructions aren’t comparable because in general they were no faster than an equivalent sequence of simpler instructions. That’s not the case for permute; there are no “simpler” instructions that let you build an efficient permute. It’s one the fundamental building blocks for efficient vector code -- that’s why it’s shocking that it was added to SSE so late.
The point is that bit twiddling can be much more efficient to implement in hardware because all you're doing is placing wires somewhere. The RBIT instruction in the article significantly speeds up an operation at very low hardware cost.
Polynomial evaluation does not fit into this pattern, because you need actual arithmetic operations to do it, and so a hardware polynomial evaluation instruction has no significant benefit over the corresponding sequence of explicit multiplications and additions.
In Intel x86/64, the fastest way I know of is to use SIMD instructions, and break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to perform a parallel lookup against another 128-bit xmm register. Then you aggregate the nibbles in reverse order, using inclusive or and variants of the shuffle instruction.
Yep. I thought this would be a huge issue when using this, but first, it's really necessary for the instruction set, and second, a lot of hardware uses 18-bit, including FPGA's (often packed w/ 18x18 multipliers and 18bit SRAMs, in order to support 8b/10b SERDES) and 72-bit DDR3.
The article makes the point that the lookup table version is fast because the table fits in D$, and that if the table were evicted it would be slower. This is true, but the more interesting point is that by loading this table into D$, you're potentially slowing down other operations.
It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.
If you had 20 similar functions, the tables would occupy 5k in total, using only 1/6th of the L1 D$ on a typical "big" CPU. In actuality, temporal locality is such that you don't often stride through all table entries uniformly, so the actual cache pressure is even less.
The point that you're going after is a good one, but its important to keep in mind how enormous modern memory hierarchies are. It often is very reasonable to trade memory and cache pressure for speed.
I'm glad they noted that the lookup table's speed relies on it being in cache, which most "benchmark magic bit-fiddling operations" posts ignore. (Although it's temporal locality, not cache coherence, that's important for this.)
Along the same vein, Andrew Dalke wrote up an interesting series of blog posts benchmarking different implementations of population count (counting the number of set bits in a word):
"Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution."
This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.
Strictly speaking, AMD's XOP extensions do have an instruction that is close enough: VPPERM. It allows to not only shuffle bytes, like the already mentioned PSHUFB, but also reverse bits within each byte. Therefore, a single VPPERM instruction can reverse up to 128 bits at a time.
Modern ARM has lots of instructions that don’t have direct x86 equivalents. Most are in the vector domain, but there are plenty of non-vector examples too: BFI, BFC, BIC, ORN, RSB, saturating arithmetic, numerous multiply-add variants, etc.
Is that true? I would agree that the time is bounded by a constant, but Big O only makes sense at all as the size of the input increases without bound.
So every terminating function is O(1) since your computer has only a finite number of possible states!
The real question is whether the input is big enough that the cost is dominated by the asymptotic behavior, and not the constant coefficient. The O(N) "obvious" algorithm was faster than the O(1) "3 ops 64 bit algorithm," so I think the answer is no, it is not big enough. N=8 sufficiently small that the asymptotic complexity is irrelevant.
I think the logical way to look at algorithmic efficiency starts by picking a reasonable N. "Number of bits in a byte" is something that very rarely changes, and never reaches high values, so it makes a bad N. The flip side of this is that something like "number of bits in main memory" is very flexible and reaches extremely large numbers, so it shouldn't be a constant.
If I wasn't making a point about different methods to flip bits, and I was just naively classifying these byte flippers, I would probably call all of these O(1). Or perhaps O(N) where N is the size of input in bytes.
Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?
I think this one goes back to PDP days and wasn't necessarily written to be the fastest possible implementation. The PDP could do 36*36 multiply into 72 bits. Not sure how the modulo instruction performed but there was a DIV instruction.
Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)
... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).
I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)
Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)
Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.
256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)
8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)
But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.
Why yes, yes I do miss fiddling around with low level details. :)
If you're talking to a dev with a background in firmware dev, I'd say you'd be hard pressed to find any body who'll put a single asm instruction into a separate function if it's speed-critical.
Sure, the compiler might (and probably will) inline, but it means any change in your tool chain or a whim of the inline heuristic can cause serious performance regressions that are completely avoidable.
When it comes to performance, the embedded mindset will always be "belt and suspenders" ;)
Expanding slightly, there’s a permutation that needs to happen in order to efficiently perform a DFT in-place (and the same approach is often used even when the transform is out-of-place). For power of two sizes (one of the most common cases), that permutation is precisely the same as a bit reversal of the indices.
Besides the FFT addressing, I have encountered SPI devices that wanted to be addressed in LSb-first manner. Some micros let you choose which end goes out first, some don't. In the latter case you find yourself swapping bits in software.
Passwords for the RFB protocol (VNC) are used to DES-encrypt a challenge token sent by the server. However, each byte in the key is reversed before it is used for encryption.
Of course, this happens only once per connection so there's generally not much of a need for it to be particularly fast.
Endian-ness would be reversing bytes, not bits within bytes, like 0x1234 -> 0x3412. What we're talking about here would be more along the lines of: 0b0010001 -> 0b1000100
The most obvious application I can think of for reversing bits within a byte would be for image processing applications, such as mirroring an image horizontally, or making kaleidoscopes. There are probably signal processing applications, too...
It was at one time standard to store each bitplane separately. If you had a 4-color 320x200 image, for example, you'd have one page of VRAM that stored a 320x200 1-bit image, holding all the bit 0s, and another, exactly the same, holding all the bit 1s. And so on up to as many bit planes as required. (That was one usual arrangement, but there are other options - e.g., interleaved bitplanes and/or no video RAM as such.)
Separate bitplanes were very annoying in many respects, and the demise of the approach was probably regretted by only a few. But storing each bitplane separately does have one major advantage: you can just write all your algorithms to operate on 1-bit images, and they automatically run at any bit depth. Just run the routine once for each bitplane you're interested in processing.
(That may also mean the hardware is simpler to implement - one plausible excuse for its ubiquity. Not my field of expertise though...)
If you've ever dealt with graphics file manipulation code chances are you've suffered the pain of changing the endian-ness of an image file. I never understood why some of these operations are not implemented as machine instructions that can run in one instruction cycle flat. There's nothing to them, I've done exactly that on FPGA's. Yes, they can be a little resource/routing intensive but not that bad.
gcc has a __builtin_bswap16, __builtin_bswap32, and __builtin_bswap64 which will presumably take advantage of these built-in instructions on x86 and any other gcc-supported architectures where similar instructions exist (and fall back to a reasonably fast and well-tested multi-instruction implementation where they don't).
You should really RTFM every couple years, just to know what your processor [1] and compiler [2] can do.
Note that this post is about changing the order of bits within a byte (eg. x86-64 XOP instruction VPPERM with bit reversal option). Whereas you are talking about changing the order of bytes within a word (x86 instruction BSWAP).
Depending on timing requirements, device type, operating speed and word width you have to add one or more layers of flip-flops to facilitate timing closure and avoid potential metastability issues.
Right, but that's true of all CPU instructions. If you already have an ALU capable of doing things like integer multiplication, would adding what is essentially a bunch of chained flip-flops really going to add much more complexity or resource usage?
Oh man this is one of my nits. I've written code like this. For example in some bit counting code I have a block comment in front of all that with 57 lines that are not blank. I have a copy of Hacker's Delight on my bookshelf, but will the person after me know what and how that code works? I really hope that there was a comment before that pointing to one of the hack web pages at least.
The difference here between the obvious method and the best method is 55ns. Is there a reason that this problem deserves this much attention for as little a difference in time? (I realize that it is 6.5x more, but if it isnt at the center of some core loop, the multiplicative factor doesnt really matter). What use cases are there for this?
I suppose the point was to show that it's pretty bad to resort to copy/pasting clever bit hacks into libraries without taking care of how they work.
The fact that the code isn't necessarily obvious makes me think that whoever used it was hoping for an optimisation of sorts.
Terseness can lead to obfuscation, and that's the wrong sort of optimisation. So we can hope that the developer was going for speed instead, but the results show that was a huge failure.
Maybe this won't affect performance in this particular library, maybe it's called once or twice and it doesn't matter, but if this is part of the innards of a game or a cryptographic function or some low-level network stack, it could have very detrimental consequences on performance.
> That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.
I thought it was also a single CPU instruction, but multiple clock cycles.
An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)
Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.
Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.