The goals sound similar to Apple's LZFSE (see https://github.com/lzfse/lzfse for...

jdcarter · on Aug 31, 2016

That was my first thought, too. I installed and ran both against a tar'd set of PDF files totaling 435MB in size. My timings:

    lzfse  45 MB/s encode, 229 MB/s decode, 1.12 comp ratio
    zstd  181 MB/s encode, 713 MB/s decode, 1.13 comp ratio

The numbers are so dramatically different that I ran several different tests, but those results showed the same rough results. I used default command-line options for both tools, and both created very similar compression ratios.

Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second. Zstandard, on the other hand, runs multiple pipelines and such--it's banking on having a server-class processor to run on.

Edit: hardware is a 2013 MacBook Pro, pretty fast flash storage, and 2 cores/4 threads. I warmed cache before each run and sent output to /dev/null, so the numbers above are best-case.

usefulcat · on Aug 31, 2016

"Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second."

Sure, but is that even relevant? I mean, is there any way that lzfse could possibly be more power-efficient per byte than zstd when zstd is 3-4 times faster for the same compression ratio? According to the docs zstd doesn't have any support for multiple threads right now, so it should be a fair comparison.

glhaynes · on Aug 31, 2016

I use "fastest to complete == least power usage" as a rule of thumb because of "race to sleep". I suppose that might be thrown off by power usage characteristics varying based on number of cores working? How does one even begin to write code that prioritizes power-efficiency over performance?

to3m · on Sept 1, 2016

For short runs, any fixed-cost warmup/cooldown periods might dominate - a latency/throughput tradeoff kind of affair. (I am just shooting my mouth off and have no idea whether that's the case here.)

As for how to do it, I'm not sure... but if I were a valued customer of various CPU suppliers, and were famous for the depth of my pockets, I'm sure I'd be able to find somebody to explain it to me ;)

Someone · on Aug 31, 2016

Can you repeat those tests using the library shipping with Mac OS X?

I'm asking because Apple claims the github code is a reference implementation. Those typically aren't tuned for speed or efficiency.

pbarnes_1 · on Aug 31, 2016

Fast == power efficient, though.

Not surprising. Apple-originated tech is always kind of middle of the road.

Jerry2 · on Aug 31, 2016

Apple's goals were also to have a low-energy de/compressor suitable for mobile. I'd love to see some comparisons of the two of them running on ARM.

MichaelGG · on Aug 31, 2016

Is low energy significantly different from high performance code? I thought the general advice was make the code as fast as possible so the work can be finished and the CPU slow/power down again.

Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?

(Memory use/access is the main thing I guess that could be different.)

bluedino · on Aug 31, 2016

>> Is low energy significantly different from high performance code?

It can be. Certain operations are more power efficient than others - subtraction and then a check for negative value instead of comparison, certain vector operations, loop unrolling, using less memory/bus traffic...

>> Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?

Slower code can be more power efficent. You're just tuning for different results.

MichaelGG · on Sept 1, 2016

Interesting. Do you know of any open source examples offhand?

I'd imagine this also requires very detailed knowledge of the chip, microcode, etc. Probably hard for x86 but I guess this kinda stuff would be on ARM where the programmer has deep vertical access (like as mentioned, Apple).

mark-r · on Aug 31, 2016

I'd love to see that too. Remember that if the processor uses 2x the power but gets done 3x faster, it's still 1.5x more efficient overall.

tmd83 · on Aug 31, 2016

Thats what I was thinking. I haven't found any validation or even the rationale behind lzfse's supposed lower power usage. I can think of two things.

1. Apple tried to write a fast and reasonably compressible version of LZ4 thus improving power usage by creating LZFSE since none existed but beaten out handsomely by Zstandard.

2. Following a parent comment, Zstandard might some of the things that are dependent on a highly OoO cpu with lots of caches, extremely good branch predictor that could be significantly slower on an ARM even the apple one despite how good they are. Or they could still be slower but on ARM the gap might not be as big and the decision not as cut and dry as it seems now.

Would love to know what the actual case is from someone involved in LZFSE.