Reversing Bits in C

jandrewrogers · on July 30, 2013

This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.

An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)

Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.

Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.

stephencanon · on July 30, 2013

This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).

Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.

All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.

[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...

jandrewrogers · on July 30, 2013

Both GCC and Clang are surprisingly mediocre at this kind of optimization. I write a lot of extreme performance integer algorithms and those compilers only seem to find "obvious" parallel instruction schedules about half the time even in isolated contexts.

Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.

I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.

stephencanon · on July 30, 2013

> Both GCC and Clang are surprisingly mediocre at this kind of optimization.

Clang, at least, has gotten significantly better in the past year (I haven’t used GCC for a while), but there’s certainly room for improvement. Please report bugs for cases that your compiler misses if you can spare a few minutes.

caf · on July 31, 2013

Have you published your hash function?

jandrewrogers · on July 31, 2013

Not yet but will soon. It is not just one function but an entire family of functions with some interesting aspects beyond just the algorithms.

The hash functions were algorithmically optimized around a scaffolding I designed that guaranteed certain performance characteristics and easy analyzability. It has literally produced many, many thousands of high quality hash functions. Tens of thousands of CPU hours have been burned on the optimization process, which is still running, and the ones that have been put into use so far were early samples pulled out of that process but the hash functions still being processed in the pipeline are statistically more robust than the earlier versions. Much easier than trying to design them the old fashioned way. At some point soon, since the optimization is converging, I will evaluate the most promising parts of the phase space to select the strongest and most aesthetic functions and publish those into the public domain.

csom · on July 31, 2013

how does it compare to, say, tabulation as in

Mihai Patrascu, Mikkel Thorup: The Power of Simple Tabulation Hashing. J. ACM 59(3): 14 (2012) http://doi.acm.org/10.1145/2220357.2220361 (free version: http://arxiv.org/abs/1011.5200)

or also

Mihai Patrascu, Mikkel Thorup: Twisted Tabulation Hashing. SODA 2013: 209-228 http://knowledgecenter.siam.org/0236-000005/

TwoBit · on July 31, 2013

This is exactly what I was planning on doing one of these days. After looking at CityHash I was sure it could be beat.

nitrogen · on July 31, 2013

This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

For people who know what they're doing (e.g. DarkShikari), the compiler doesn't stand a chance: http://www.scribd.com/doc/137419114/Introduction-to-AVX2-opt... (via https://news.ycombinator.com/item?id=5598010)

P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.

Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...

stephencanon · on July 31, 2013

Non-trivial vectorization is an entirely different sort of optimization than re-association to extract ILP (especially in the integer domain where the vector instructions often have no exact non-vector analogue).

Yes, there are people who have the knowledge and experience to regularly beat the compiler. I do it professionally. My point is not "don't bother optimizing, let the compiler do it". My point is "this particular micro-optimization is less valuable in the real world than focused benchmarks suggest, and good compilers often do it (this one specific optimization) for you, anyway."

anonymous · on July 31, 2013

Was there a followup to that article? I'm pretty sure the NDA period should be up by now and I'm wondering if the author posted a more in-depth explanation and overview.

onan_barbarian · on July 31, 2013

Absolutely true.

Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.

It's a far cry from RISC - like a load/store machine.

All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).

stephencanon · on July 31, 2013

You also need to have loads to do. Adding spurious loads just to boost uop retirement rate would be an interesting choice. :)

carterschonwald · on July 30, 2013

Very very good points! Relatedly: for any performance sensitive code, reading the relevant version of the Intel Optimization manual + a book like Hacker's Delight will lead to a lot of good understanding of these trick.

(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)

gregnavis · on July 31, 2013

Could you recommend a good reading (a book preferably) for learning about CPU architecture. I'm aware of Intel Software Developer's Manual. What other resources are worth reading?

comatose_kid · on July 31, 2013

Computer Architecture: A Quantitative Approach by Hennesey and Patterson is the definitive text, and it is good (at least the 3rd edition I read back in 02).

One way to put the ideas you learn into practice is to try to write an optimized large NxN matrix multiplication routine. Start in C, converting the kernel to x86 code. Also disassemble the generated C code in the kernel to see what the compiler is doing. See how close you can get to the theoretical peak CPU performance.

It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.

rainforest · on July 30, 2013

The multiplication trick reminds me of this StackOverflow answer[1] where an SMT solver (z3) is used to derive mask and multiplier to extract chosen bits from a byte.

[1] : http://stackoverflow.com/questions/14547087/extracting-bits-...

nkurz · on July 31, 2013

That's really interesting, and an approach to such problems that I'd never considered. I was excited that a "Code generator for bit permutations" (http://programming.sirrida.de/calcperm.php) exists, but using a theorem prover is really another level of possibility. Now I need to figure out how to apply it to the problem I'm currently thinking about: http://stackoverflow.com/questions/17880178/how-do-i-sum-the...

pbsd · on July 31, 2013

We can use the exact same approach used in the bit reversal trick of the article:

  ((x * 0x01010101) & 0xC0300C03) % 1023

This is probably not gonna be faster than the naive approach, though.

pbsd · on July 31, 2013

Thinking a little further about this, I believe using PSHUFB is the way to go, at least for when the count is large. This is because we can do 2 iterations in essentially one go (haven't tested the code, it's mostly a sketch):

  vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
  vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
  vmovdqu xmm7, [rdi]
  _loop_body:
  vpand   xmm8, xmm15, [rdi]
  vpsrlw  xmm9, xmm7, 4
  vpand   xmm9, xmm9, xmm15
  vpshufb xmm8, xmm0, xmm8
  vpshufb xmm9, xmm0, xmm9
  vpaddb  xmm8, xmm8, xmm9

  vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
                           ; xmm7[0] = xmm8[xmm8[0]]
  vpaddb  xmm8, xmm8, xmm7 ; add it

  vpextrb eax, xmm8, 0
  vmovdqu xmm7, [rdi + rax]
  add rdi, rax
  sub esi, 2
  jnz _loop_body

This is likely extendable to 32-byte vectors with AVX2; have not thought much about that case.

nkurz · on Aug 1, 2013

This is exciting, but I think I may have tricked you on a couple of details. The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the maximum offset is actually 17, which means one can't depend on having two keys within a single 16 byte vector. I'm trying to figure out if there is some way to compensate for this without halving the performance.

pbsd · on Aug 1, 2013

Oh, I missed that. That makes things trickier, but I think we can still get away with something like

  vmovdqu xmm7, [rdi + rax + 5 - 1]
  vpinsrb xmm7, xmm7, [rdi + rax + 0], 0

without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.

nkurz · on Aug 1, 2013

Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.

pbsd · on Aug 1, 2013

I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this:

  vpperm xmm0, xmm1, xmm2, [[0..31] + offset]

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

nkurz · on Aug 3, 2013

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.

I do greatly appreciate your help, though. Thanks!

nkurz · on Aug 1, 2013

Thanks for looking at this! This is wonderful and succinct (presuming it works, I'm still staring at it), but the 20+ latency of DIV makes it impractical. I think I can do it with multiply, and, multiply, shift, but even 3 cycle latency for the multiplies is too much:

  key *= 0x04040400UL;
  key &= 0x30C30C00UL;
  key *= 0x00041041UL;
  key >>= 28;

Let me stare at your shuffling solution for a bit...

pbsd · on Aug 1, 2013

In the interest of achieving the simplest instruction sequence, I've also found (AVX2):

  key   = _pdep_u32(key, 0x03030303);
  key  *= 0x01010101;
  key >>= 24;

mgraczyk · on July 30, 2013

The author seems to misunderstand the idea of asymptotic complexity. All of the reversal operations are O(1) because the number of bits being flipped is a constant. If he were concerned with flipping the bits in an arbitrary precision number, then his different solutions might deserve "Big-O" classifications.

Second point: The reason that the original solution is slow is because a mod operation by a number that is not a power of two involves a floating point divide, or several multiply accumulates at extended precision. Either of those two operations are slower than any of the other methods.

cnvogel · on July 30, 2013

Interestingly, while x86-64 does not seem to have a single opcode for reversing bits in a byte, it has a function to arbitrarily shuffle around the 16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much data those SIMD instructions process or move around in relatively few clock-cycles.

http://stackoverflow.com/a/9040426

http://www.intel.com/content/www/us/en/processors/architectu... (it's on page 1256 of 3251).

stephencanon · on July 30, 2013

It’s actually shocking how long it took Intel to add PSHUFB to SSE. Altivec (PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B) way back in 1999.

chacham15 · on July 30, 2013

The VAX (circa 1977) had polynomial evaluation as an instruction[1]. What is your point?

[1] http://en.wikipedia.org/wiki/VAX

stephencanon · on July 30, 2013

Like my sibling posted, the crazy CISCy instructions aren’t comparable because in general they were no faster than an equivalent sequence of simpler instructions. That’s not the case for permute; there are no “simpler” instructions that let you build an efficient permute. It’s one the fundamental building blocks for efficient vector code -- that’s why it’s shocking that it was added to SSE so late.

nhaehnle · on July 30, 2013

The point is that bit twiddling can be much more efficient to implement in hardware because all you're doing is placing wires somewhere. The RBIT instruction in the article significantly speeds up an operation at very low hardware cost.

Polynomial evaluation does not fit into this pattern, because you need actual arithmetic operations to do it, and so a hardware polynomial evaluation instruction has no significant benefit over the corresponding sequence of explicit multiplications and additions.

daniel-cussen · on July 30, 2013

In the GA144, lookup tables are pretty painful, so the way I implement reverse there is:

reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;

In Intel x86/64, the fastest way I know of is to use SIMD instructions, and break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to perform a parallel lookup against another 128-bit xmm register. Then you aggregate the nibbles in reverse order, using inclusive or and variants of the shuffle instruction.

keenerd · on July 30, 2013

This does an 18 bit word, right?

daniel-cussen · on July 31, 2013

Yep. I thought this would be a huge issue when using this, but first, it's really necessary for the instruction set, and second, a lot of hardware uses 18-bit, including FPGA's (often packed w/ 18x18 multipliers and 18bit SRAMs, in order to support 8b/10b SERDES) and 72-bit DDR3.

twoodfin · on July 30, 2013

The article makes the point that the lookup table version is fast because the table fits in D$, and that if the table were evicted it would be slower. This is true, but the more interesting point is that by loading this table into D$, you're potentially slowing down other operations.

It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.

stephencanon · on July 31, 2013

If you had 20 similar functions, the tables would occupy 5k in total, using only 1/6th of the L1 D$ on a typical "big" CPU. In actuality, temporal locality is such that you don't often stride through all table entries uniformly, so the actual cache pressure is even less.

The point that you're going after is a good one, but its important to keep in mind how enormous modern memory hierarchies are. It often is very reasonable to trade memory and cache pressure for speed.

Scaevolus · on July 30, 2013

I'm glad they noted that the lookup table's speed relies on it being in cache, which most "benchmark magic bit-fiddling operations" posts ignore. (Although it's temporal locality, not cache coherence, that's important for this.)

_ihaque · on July 30, 2013

Along the same vein, Andrew Dalke wrote up an interesting series of blog posts benchmarking different implementations of population count (counting the number of set bits in a word):

http://dalkescientific.com/writings/diary/archive/2008/07/03...

http://dalkescientific.com/writings/diary/archive/2008/07/05...

http://dalkescientific.com/writings/diary/archive/2011/11/02...

The Stanford Bit Hacks page linked in the original article is also very interesting reading for folks into this sort of stuff.

fjarlq · on July 30, 2013

A great companion to this sort of thing is the book Hacker's Delight by Henry S. Warren, Jr:

http://www.hackersdelight.org/

http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B...

kibwen · on July 30, 2013

"Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution."

This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.

pbsd · on July 30, 2013

Strictly speaking, AMD's XOP extensions do have an instruction that is close enough: VPPERM. It allows to not only shuffle bytes, like the already mentioned PSHUFB, but also reverse bits within each byte. Therefore, a single VPPERM instruction can reverse up to 128 bits at a time.

stephencanon · on July 30, 2013

Modern ARM has lots of instructions that don’t have direct x86 equivalents. Most are in the vector domain, but there are plenty of non-vector examples too: BFI, BFC, BIC, ORN, RSB, saturating arithmetic, numerous multiply-add variants, etc.

Symmetry · on July 30, 2013

Very interesting, though you shouldn't be surprised by small differences between O(1) and O(N) algorithms when N is only 8.

stephencanon · on July 30, 2013

If N is 8, then O(N) is O(1). For that matter, so is O(f(N)), for any function f.

MichaelBurge · on July 30, 2013

Is that true? I would agree that the time is bounded by a constant, but Big O only makes sense at all as the size of the input increases without bound.

drivers99 · on July 30, 2013

If you define N<=8 from the beginning, then there exists some constant that is the maximum time the function will take. That makes it O(1).

millstone · on July 31, 2013

So every terminating function is O(1) since your computer has only a finite number of possible states!

The real question is whether the input is big enough that the cost is dominated by the asymptotic behavior, and not the constant coefficient. The O(N) "obvious" algorithm was faster than the O(1) "3 ops 64 bit algorithm," so I think the answer is no, it is not big enough. N=8 sufficiently small that the asymptotic complexity is irrelevant.

Dylan16807 · on July 31, 2013

I think the logical way to look at algorithmic efficiency starts by picking a reasonable N. "Number of bits in a byte" is something that very rarely changes, and never reaches high values, so it makes a bad N. The flip side of this is that something like "number of bits in main memory" is very flexible and reaches extremely large numbers, so it shouldn't be a constant.

If I wasn't making a point about different methods to flip bits, and I was just naively classifying these byte flippers, I would probably call all of these O(1). Or perhaps O(N) where N is the size of input in bytes.

KaeseEs · on July 30, 2013

Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?

zwieback · on July 30, 2013

I think this one goes back to PDP days and wasn't necessarily written to be the fastest possible implementation. The PDP could do 36*36 multiply into 72 bits. Not sure how the modulo instruction performed but there was a DIV instruction.

binarymax · on July 30, 2013

Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167

groby_b · on July 30, 2013

Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)

stephencanon · on July 30, 2013

... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).

groby_b · on July 30, 2013

I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)

Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)

caf · on July 31, 2013

Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.

nly · on July 30, 2013

256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)

groby_b · on July 31, 2013

8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)

But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.

Why yes, yes I do miss fiddling around with low level details. :)

saalweachter · on July 31, 2013

... don't C++ people always tell us that inlining is the single easiest optimization for a compiler to perform?

groby_b · on July 31, 2013

Yeah. Sure. Especially across libraries.

If you're talking to a dev with a background in firmware dev, I'd say you'd be hard pressed to find any body who'll put a single asm instruction into a separate function if it's speed-critical.

Sure, the compiler might (and probably will) inline, but it means any change in your tool chain or a whim of the inline heuristic can cause serious performance regressions that are completely avoidable.

When it comes to performance, the embedded mindset will always be "belt and suspenders" ;)

applecore · on July 30, 2013

Interesting. What's the purpose of reversing the bits in a byte?

RodgerTheGreat · on July 30, 2013

A common approach for performing a Fast Fourier Transform involves reversing the bits in time-domain samples.

stephencanon · on July 30, 2013

Expanding slightly, there’s a permutation that needs to happen in order to efficiently perform a DFT in-place (and the same approach is often used even when the transform is out-of-place). For power of two sizes (one of the most common cases), that permutation is precisely the same as a bit reversal of the indices.

ajb · on July 30, 2013

Reversing the bits in the index to the samples :-)

markrages · on July 30, 2013

Besides the FFT addressing, I have encountered SPI devices that wanted to be addressed in LSb-first manner. Some micros let you choose which end goes out first, some don't. In the latter case you find yourself swapping bits in software.

groby_b · on July 30, 2013

My guess would be FFT. IIRC, a fast implementation of that requires lots of bit reversal. My memory is rusty, though :)

quasive · on July 31, 2013

Passwords for the RFB protocol (VNC) are used to DES-encrypt a challenge token sent by the server. However, each byte in the key is reversed before it is used for encryption.

Of course, this happens only once per connection so there's generally not much of a need for it to be particularly fast.

kbojody · on July 30, 2013

Endian is probably the most common.

unoti · on July 30, 2013

Endian-ness would be reversing bytes, not bits within bytes, like 0x1234 -> 0x3412. What we're talking about here would be more along the lines of: 0b0010001 -> 0b1000100

The most obvious application I can think of for reversing bits within a byte would be for image processing applications, such as mirroring an image horizontally, or making kaleidoscopes. There are probably signal processing applications, too...

xymostech · on July 30, 2013

I don't even think that mirroring images would require reversing bytes, unless you want to mess up the color components also or something...

ErsatzVerkehr · on July 30, 2013

> mirroring an image horizontally

...but only a 2-color image.

to3m · on July 30, 2013

It was at one time standard to store each bitplane separately. If you had a 4-color 320x200 image, for example, you'd have one page of VRAM that stored a 320x200 1-bit image, holding all the bit 0s, and another, exactly the same, holding all the bit 1s. And so on up to as many bit planes as required. (That was one usual arrangement, but there are other options - e.g., interleaved bitplanes and/or no video RAM as such.)

Separate bitplanes were very annoying in many respects, and the demise of the approach was probably regretted by only a few. But storing each bitplane separately does have one major advantage: you can just write all your algorithms to operate on 1-bit images, and they automatically run at any bit depth. Just run the routine once for each bitplane you're interested in processing.

(That may also mean the hardware is simpler to implement - one plausible excuse for its ubiquity. Not my field of expertise though...)

robomartin · on July 30, 2013

If you've ever dealt with graphics file manipulation code chances are you've suffered the pain of changing the endian-ness of an image file. I never understood why some of these operations are not implemented as machine instructions that can run in one instruction cycle flat. There's nothing to them, I've done exactly that on FPGA's. Yes, they can be a little resource/routing intensive but not that bad.

picomancer · on July 30, 2013

> changing the endian-ness

x86 has had the BSWAP instruction since the 486.

gcc has a __builtin_bswap16, __builtin_bswap32, and __builtin_bswap64 which will presumably take advantage of these built-in instructions on x86 and any other gcc-supported architectures where similar instructions exist (and fall back to a reasonably fast and well-tested multi-instruction implementation where they don't).

You should really RTFM every couple years, just to know what your processor [1] and compiler [2] can do.

[1] http://www.intel.com/content/www/us/en/processors/architectu...

[2] http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

robomartin · on July 30, 2013

Oh, I RTFM. Not always working on Intel platforms though. And still:

http://hardwarebug.org/2010/01/14/beware-the-builtins/

mrb · on July 30, 2013

Note that this post is about changing the order of bits within a byte (eg. x86-64 XOP instruction VPPERM with bit reversal option). Whereas you are talking about changing the order of bytes within a word (x86 instruction BSWAP).

zhemao · on July 30, 2013

Wait, why is it resource intensive? If all you need to do is reverse a fixed-size integer, wouldn't you just wire the inputs to the outputs backwards?

robomartin · on July 30, 2013

Depending on timing requirements, device type, operating speed and word width you have to add one or more layers of flip-flops to facilitate timing closure and avoid potential metastability issues.

zhemao · on July 30, 2013

Right, but that's true of all CPU instructions. If you already have an ALU capable of doing things like integer multiplication, would adding what is essentially a bunch of chained flip-flops really going to add much more complexity or resource usage?

robomartin · on July 31, 2013

I was mostly talking about FPGA's. I don't know the criteria designers use when making decisions about what to add (or not) to a CPU design.

mzs · on July 30, 2013

Oh man this is one of my nits. I've written code like this. For example in some bit counting code I have a block comment in front of all that with 57 lines that are not blank. I have a copy of Hacker's Delight on my bookshelf, but will the person after me know what and how that code works? I really hope that there was a comment before that pointing to one of the hack web pages at least.

chacham15 · on July 30, 2013

The difference here between the obvious method and the best method is 55ns. Is there a reason that this problem deserves this much attention for as little a difference in time? (I realize that it is 6.5x more, but if it isnt at the center of some core loop, the multiplicative factor doesnt really matter). What use cases are there for this?

Renaud · on July 31, 2013

I suppose the point was to show that it's pretty bad to resort to copy/pasting clever bit hacks into libraries without taking care of how they work.

The fact that the code isn't necessarily obvious makes me think that whoever used it was hoping for an optimisation of sorts.

Terseness can lead to obfuscation, and that's the wrong sort of optimisation. So we can hope that the developer was going for speed instead, but the results show that was a huge failure.

Maybe this won't affect performance in this particular library, maybe it's called once or twice and it doesn't matter, but if this is part of the innards of a game or a cryptographic function or some low-level network stack, it could have very detrimental consequences on performance.

munificent · on July 30, 2013

> That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.

I thought it was also a single CPU instruction, but multiple clock cycles.

wnissen · on July 30, 2013

It's not a single hardware instruction on a 32-bit CPU, I believe is the point.

gsg · on July 31, 2013

A lot of CPUs don't have a division instruction at all, let alone 64 bit division.

barbs · on July 30, 2013

Ack! Light grey on white background! My eyes!! Seriously, that's really annoying.

duedl0r · on July 31, 2013

Why on earth does this article have so many upvotes? Running time analysis is completely wrong... O(n) vs O(1) and such...tss..don't get me started..