SHA-3 does seem to have relatively little to offer by way of incentives to switc...

agl · on May 31, 2017

Gah, thank you. I did, indeed, mean to link to bench.cr.yp.to. Fixed.

(The BLAKE2 page (https://blake2.net/) has a graph too.)

dom0 · on May 31, 2017

> "Easier to implement in hardware" will be more compelling when such hardware exists.

Most things are software; from previous experience we know that it basically takes 10 to 15 years between introduction of a primitive (AES and SHA-2 both standardised around 2000; ISA extensions for mainstream CPUs introduced in 2010-2017). From the software PoV SHA-3 would be a rather large regression in performance with a "maybe it's fast by 2030 if Intel is really nice" attached. BLAKE2 on the other hand is an improvement in performance and offers a function that is more modern overall. (E.g. not length-extensible thus low-overhead keyed hash usage, flexible output sizes, tree hashing built-in)

mtgx · on May 31, 2017

Yes, perhaps before they choose the next "faster in hardware" crypto algorithm, they also get a commitment from the major chip makers (Intel, AMD, Qualcomm, MediaTek, etc) that they will implement them within let's say 2 years after the contest is over.

Otherwise it's just hope that the chip makers will implement it when the algorithms are chosen, and I'm not sure that's good enough after a carefully orchestrated 5-year process of choosing a next-generation algorithm.

If they can't get that commitment, then they should choose whatever is faster in software (granted all the other factors are more or less equal).

jasode · on May 31, 2017

>I'm curious about the statement that SHA-3 is slow; [...] I wonder how much relative attention the SHA-3 winner (Keccak) gets compared to other alternatives, like BLAKE?

Coincidentally, I ran a bunch of hash performance benchmarks last week. These were my findings:

  test:  hash a 500MB block of memory.
  hardware:  Intel Core i7-5820K Haswell-E 6-Core 3.3GHz
  compiler:  MSVC2017 (19.10.25019), 32-bit exe:
    blake2sp - official reference code[1]     153MB/sec
    SHA3 - Keccak official reference code[2]   12MB/sec
    SHA3 - rhash sha3[3]                       45MB/sec
    SHA3 - Crypto++ library v5.6.5[4]          57MB/sec
    SHA256 - Crypto++                         181MB/sec
    SHA256 - MS Crypto API[5]                 113MB/sec
    SHA1 - MS Crypto API                      338MB/sec
    MD5 - Crypto++                            345MB/sec
    CRC32 - Crypto++                          323MB/sec

The conclusion is that the fastest SHA3 implementation (Crypto++ lib with its assembly language optimizations) is more than 2x-3x slower than SHA256. I can't speak for SHA3 implemented in FPGA/ASIC but as far as C++ compilation targeting x86, it's slow. I've been meaning to try the Intel Compiler to see if it yields different results but haven't gotten around to it yet.

Blake2sp is fast. The official reference code is not quite as fast as Crypto++ implementation of SHA256 but it's faster than Microsoft's Crypto API of SHA256. (There are several variants of BLAKE and I chose blake2sp because that's the algorithm WinRAR uses. I think the specific variant of BLAKE that directly competed with Keccack for NIST standardization is slower.)

[1] https://github.com/BLAKE2/BLAKE2

[2] http://keccak.noekeon.org/files.html

[3] https://github.com/rhash/RHash/blob/master/librhash/sha3.c

[4] https://www.cryptopp.com/

[5] https://msdn.microsoft.com/en-us/library/ms867086.aspx

optimiz3 · on May 31, 2017

I'll just say this - there are private implementations of Keccak that blow the pants off of those numbers.

Keccak is one of the primary functions in Ethash (Ethereum mining). It is heavily researched and completely destroys SHA2 performance-wise if you have the right implementation.

Also, Keccak doesn't require the construction of a key-schedule, and can be implemented much more elegantly in parallel (SIMD software) and in hardware than SHA2.

briansmith · on May 31, 2017

I guess 32-bit x86 performance is maybe not the best benchmark. I think people aren't optimizing for that ISA to the same extent as they are optimizing for x86-64, 64-bit ARMv8, or 32-bit ARM.

If you care about performance and you don't have dedicated SHA-256 instructions then on a 64-bit platform you should evaluate SHA-512 as it is much faster. If you only have 256 bits of storage available then truncate its output to 256 bits. IIRC, it's about 1GB/sec on my Haswell laptop.

jasode · on May 31, 2017

>I guess 32-bit x86 performance is maybe not the best benchmark.

I compiled for 32bit instead of 64bit because I wanted the same executable to also run on a 32bit Macbook. When Thomas Pornin ran benchmarks[1] in 2010 for both 32bit & 64bit, the SHA256 hash performance didn't change as much as the SHA512. I'll recompile for 64bit and report back if there was a massive difference.

[1] https://stackoverflow.com/questions/2722943/is-calculating-a...

baby · on May 31, 2017

SHA-3 lanes are 64 bits. In 64 bits arch they can use full registers for operations. I'd bet it would be way faster on 64bit

npscalar · on June 5, 2017

Mismatches can show an interesting property. It is likely that SHA-256 is slower than Keccak on 64-bit platforms, and that SHA-512 is slower than Keccak on 32-bit platform.

baby · on May 31, 2017

What about cShake or better KangarooTwelve? (K12 is to SHA-3 what blake2 is to blake).

dchest · on May 31, 2017

Hmm, something is wrong with your benchmark: the absolute values in MB/s are very low for i7, and relative too — blake2sp should be much faster than SHA256.

jasode · on May 31, 2017

>blake2sp should be much faster than SHA256.

blake2sp is indeed faster than Microsoft's builtin Crypto API for SHA256. However, it is not as fast as Wei Dai's Crypto++ library implementation of SHA256 that has lots of hand tuned assembly language code.

The official C source code for blake2sp does not have assembly language primitives in it. It's very possible that if an assembly language expert wrote optimizations for blake2sp, it would beat Crypto++ SHA256 performance.

The code I used is really simple. Used files "blake2sp-ref.c" and "blake2s-ref.c" from the BLAKE website. The hash code (no loops) is:

  blake2sp_state S[1];    // BLAKE 1 element array of the struct for state
  blake2sp_init(S, BLAKE2S_OUTBYTES);
  blake2sp_update(S, BufInput, iBuffersize);
  blake2sp_final(S, hashval_blake2sp_bytes, BLAKE2S_OUTBYTES);

... where iBuffersize is 500MB but I got the same results at larger buffer sizes 1GB+.

With that code, I'm guessing anyone can whip up a C++ project to benchmark blake2sp in 10 minutes. It would be interseting to see what MB/sec that others achieve.

dchest · on May 31, 2017

My pure JavaScript implementation of blake2s (which is approximately half the speed of parallelized sp variant) on 2.6 GHz Core i5 hashes at 170 MiB/s. JavaScript! Whatever you do with your benchmark, you're doing it wrong. Also, there is no reason to use such large buffer sizes, I suspect this only makes benchmarks more unreliable.

For real numbers on many platforms, see https://bench.cr.yp.to/results-hash.html (warning: very large page)

For long messages on 2015 Intel Core i5-6600; 4 x 3310MHz:

    blake2b: 3.33 cycles/byte
    blake2s (remember, half speed of sp): 4.87 cycles/byte
    sha512: 5.06 cycles/byte
    sha256: 7.63 cycles/byte

This means BLAKE2s is close to 700 MB/s, and BLAKE2sp will hash data at more than 1 GB/s.

Here are the SHA-256 implementations measured: https://bench.cr.yp.to/impl-hash/sha256.html

As you can see, the winner varies by platform, but in most cases OpenSSL wins, and Wei Dai's implementation is close.

jasode · on May 31, 2017

>My pure JavaScript implementation of blake2s (which is approximately half the speed of parallelized sp variant) on 2.6 GHz Core i5 hashes at 170 MiB/s.

If I recompile blake2sp with "/O2" optimization, it improves to 171MB/sec. I ran tests with "/Od" optimizations disabled because the default Crypto++ library project has optimizations disabled when it makes the lib file. Therefore, every hash had no optimizations to keep the comparisons apples to apples.

I see your Javascript code of BLAKE2 on github so I'll try it and compare. (https://github.com/dchest/blake2s-js)

>Also, there is no reason to use such large buffer sizes, I suspect this only makes benchmarks more unreliable.

I chose 500MB because I wanted the fastest hashes (CRC32, MD5) to take at least 1 second and many data sizes I want to hash will be 10GB+.

dchest · on May 31, 2017

If I recompile blake2sp with "/O2" optimization, it improves to 171MB/sec.

Too slow! You're doing something wrong or measuring some slow implementation :) It should be more than 500 MB/s.

Therefore, every hash had no optimizations to keep the comparisons apples to apples.

That's not apples to apples at all. Most performant code is written specifically to be optimized by compiler. Use /O3 for benchmarking. Also, you just wrote that you were measuring a hand-optimized assembly version and then compared it to a C version compiled with optimization disabled? O_o

I chose 500MB because I wanted the fastest hashes (CRC32, MD5) to take at least 1 second and many data sizes I want to hash will be 10GB+.

Then call the update function with a reasonable 8K buffer many times. Using such large buffers will generate a lot of noise in benchmarks.

* * *

Speaking of JS, you can try my newer implementations/ports by cloning https://github.com/StableLib/stablelib. run ./scripts/build, cd into packages/blake2s (/sha3, /sha256, etc.) and run: node lib/.bench.js Note that SHA-3 (and SHA-512, BLAKE2b [not implemented yet]) is slow in JS compared to SHA-256, BLAKE2s, etc. because it uses 64-bit numbers, so in JS I have to emulate them by using two 32-bit ones for low and high bits.

* *

Benchmarks on Intel Skylake:

https://blake2.net/skylake.png

jasode · on May 31, 2017

>Too slow! You're doing something wrong

I do now notice that my ASUS motherboard monitor software is reporting that my CPU is at 1.2GHz instead of 3.3GHz. There's probably something wrong there. However, even if I get it up to 3.3GHz, the relative speeds between different benchmarks won't change. I got the same relative numbers on the Macbook.

>measuring a hand-optimized assembly version and then compared it to a C version compiled with optimization disabled? O_o

Because there's lots of Wei Dai code that's C++ code instead of asm. He delivered his MSVC project with optimizations disabled instead of "/O3" so it made the most sense to start with optimizations disabled everywhere as a preliminary benchmark. If I recompile Wei Dai's code with optimization, it will make Crypto++ perform faster and make blake2sp look slower. In the end, it's a moot point because blake2sp with "/O2" is still slower than Crypto++ SHA256.

>Using such large buffers will generate a lot of noise in benchmarks.

Why? If you study the blake2sp source code, you'll see a loop inside the hash update() function to handle arbitrary sizees of buffers. Why does a loop outside that update() mean "less noise"? Why does adding more function calls of BUFSIZETOTAL divided by 8192 equal less noise?

dchest · on May 31, 2017

If I recompile Wei Dai's code with optimization, it will make Crypto++ perform better and make blake2sp look slower.

Just do it. If it's slower, you're doing something wrong. Which implementation of blake2sp are you measuring? It should be at least 1.5x as fast.

Why? If you study the source code, you'll see a loop inside the hash update() function. Why does a loop outside that update() mean "less noise"? Why does adding more function calls of BUFSIZETOTAL divided by BLOCK PROCESSING SIZE equal less noise?

Because you'll be measuring a lot of memory copying time apart from hashing time. Dealing with memory outside of CPU cache introduces a lot of variance.

I suggest that instead of our discussion you should try to reproduce the results of https://bench.cr.yp.to (which is a highly trusted source, e.g. it was used during SHA-3 competition by NIST and participants): if something doesn't approximately match it means you did something wrong. In the process, you'll learn how to properly benchmark hash functions and make some sciency science by reproducing results! :-)

jasode · on June 1, 2017

I think I found the issue. In terms of absolute (not relative) MB/sec performance, the main culprit is the slow cpu speed of 1.2GHz single threaded performance instead of 3.3GHz. The other investigations into "/O2 /O3" optimizations or 8k outer loop were red herrings.

However, for relative MB/sec performance comparison to SHA256, it seems to point back to the blake2 official reference code (non SSE) being very slow. Wei Dai Crypto++ also happens to have BLAKE2 algorithm and when I executed that, it ran at 525MB/sec which was faster than SHA256 and also faster than SHA1. No outer 8k chunk loop necessary for Crypto++ benchmark.

>Because you'll be measuring a lot of memory copying time apart from hashing time.

Yes, I notice the numerous memcpy() functions in the blake2s?-ref.c. For additional tests, I rewrote the loop to call update() on chunks and tried various sizes (8k, 16k, 32k, ... 256k, 512k, 1MB). At 256k chunks and below, I got 235MB/sec which was an improvement but still slower than SHA256. As stated above, the real key was to use an optimized BLAKE2 algorithm instead of the official reference code.

>you should try to reproduce the results of https://bench.cr.yp.to

I can't tell if the blake2 entries in https://bench.cr.yp.to are using official reference or optimized code so trying to replicate those results with official reference files may be a wild goose chase.

ComputerGuru · on June 1, 2017

If your machine is throttling the CPU during benchmarks, all bets are off. You need to disable all power management and energy savings or else figure out why your machine is causing a throttle to kick in (likely thermal management). You can't compare any results (even relative to one another) until that issue is resolved.

dchest · on June 1, 2017

The implementations measured are listed here:

https://bench.cr.yp.to/impl-hash/blake2s.html

Code mirror: https://github.com/floodyberry/supercop

Yeah, of course reference implementation is a lot slower — it's for algorithm reference, so has readable code, which is slow. It can even be greatly improved by just unrolling loops and inlining message indexing by sigma constants, even without SSE (like my JavaScript implementation).

As for memory copying — I didn't mean memcpy(), what I meant is that your CPU would have to get chunks of your huge buffer from RAM, which is slower and has more unpredictable performance.

jasode · on June 7, 2017

>Yeah, of course reference implementation is a lot slower — it's for algorithm reference, so has readable code, which is slow.

Exactly! That's why my original post had the footnote that I using the slower blake2sp reference code. Same situation as the SHA-3 reference code being the slowest implementation.

>As for memory copying — I didn't mean memcpy(), what I meant is that your CPU would have to get chunks of your huge buffer from RAM, which is slower and has more unpredictable performance.

But this observation also applies to all the other hash performance tests. If the blake2sp hash is handicapped by computing large RAM buffers beyond the L1/L2/L3 caches, the MD5/SHA1/SHA256/etc are also handicapped the same way. Whatever "noise" exists in the tests can be evened out by multiple executions.

Restricting the tests to tiny memory sizes that fit in L1/L2/L3 is not realistic for my purposes.

mappu · on May 31, 2017

There is AVX/AVX2 assembly for BLAKE2 at https://github.com/minio/blake2b-simd , benchmarked at a 1.8x-3.9x speedup.

gigatexal · on May 31, 2017

It'd be interesting to see what SHA2-512 would do since I think modern intel CPUs prefer that size to 256. (I could be wrong though as i have no source to cite just something I remember reading.)

dchest · on May 31, 2017

You're not wrong. Type this into Terminal:

     openssl speed sha512 sha256

gigatexal · on June 1, 2017

tl;dr sha512 outperforms sha256 as message size increases.

pastebin output for lack of decent formatting:

https://pastebin.com/zqgtTKhm

on a linode system with 2 xeon E5-2680v3 cores @ 2.5ghz and 4GB of ram on this version and config of openssl on ubuntu server 17.04:

OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fdebug-prefix-map=/build/openssl-p_sOry/openssl-1.0.2g=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM

ktta · on June 2, 2017

You might also want to test out BLAKE2bp, which is optimized for 64-bit platforms, while BLAKE2sp is for 32-bit platforms. So the numbers for BLAKE2 will be even higher

snakeanus · on May 31, 2017

I am quite surprised that CRC32 is slower than MD5.

> I think the specific variant of BLAKE that directly competed with Keccack is slower

BLAKE is slower than BLAKE2 indeed. BLAKE also has more rounds than BLAKE2.

dfox · on May 31, 2017

CRC32 is slower because implementing it in software a way that exploits modern CPU's instruction level paralelism is non-trivial, also the obvious way how to implement CRC32 on 32bit byte oriented CPU isn't exactly cache friendly.

tjalfi · on June 1, 2017

Are you referring to Intel's Slicing by 8 algorithm[0].

[0] https://static.aminer.org/pdf/PDF/000/432/446/a_systematic_a...

baby · on June 1, 2017

Daemen talked about some figures for KangarooTwelve during RWC 2017:

* Intel Core i5-4570 (Haswell) 4.15 c/b for short input and 1.44 c/b for long input

* Intel Core i5-6500 (Skylake) 3.72 c/b for short input and 1.22 c/b for long input

* Intel Xeon Phi 7250 (Knights Landing) 4.56 c/b for short input and 0.74 c/b for long input

https://youtu.be/F_nAGS9L97c?t=2451

pshc · on June 1, 2017

I recently came to believe (correct me if I'm wrong) that the recommended XOFs SHAKE128 and SHAKE256 are actually nice and fast, but often when people talk about SHA-3 they focus on the slower "drop-in replacements" like SHA3-256 which are quite conservative.

baby · on June 1, 2017

Yes SHAKE is faster than SHA-3 because of "better" parameter choices.

npscalar · on June 5, 2017

This page has an interesting speed graph: http://kangarootwelve.org/