I hate base64. If you have to embed binary data within human-readable text, then...

_wmd · on Jan 17, 2018

Base64 has been around forever, and hex has significantly lower information density than it - consistently more than 30% smaller.

As for doing something horribly wrong, I dunno, ensuring literally billions of moms and pops can exchange baby pictures by e-mail, for the first time in human history, and doing so since the time when we were using 33.6kbit modems, that seems like a resounding success.†

† Yes, if e-mail had not paid such close attention to compatibility, and just ripped up every old standard on every chance it got (i.e. pretty much like any technology invented since the commoditization of the web), interoperability would never have happened. Legacy crap is fundamentally a part of that success, and Base64 is fundamentally a part of making those hacks practical for the technology we had at the time††

†† To head off an obvious counterargument: yes, there are contemporary examples of why this strategy is still important (it's, so far, the only one that has ever worked in the decentralized environment of the Internet), and is very likely to come in useful in future, time and time again

hackcasual · on Jan 18, 2018

Hex has lower information density, however it is more compressible, particularly if your data is byte aligned. For example the string "aaaaaa" is b64 encoded as "YWFhYWFh" but in hex is "616161616161". On a project I'm on, we switched from base64 to hex for some binary data embedded in JSON, and saw ~20% size reduction, since its always compressed by the web server.

_wmd · on Jan 18, 2018

Today I learned!

    >>> import random,zlib,codecs
    >>> sample = random.getrandbits(1024).to_bytes(1024, 'big') * 1024
    >>> len(zlib.compress(codecs.encode(sample, 'base64')))
    22083
    >>> len(zlib.compress(codecs.encode(sample, 'hex')))
    4326
    >>>

edit: (25 minutes of digging around the Internet, and I find this):

    >>> sample = os.urandom(1024) * 1024

    >>> len(sample.encode('base64'))
    1416501
    >>> len(sample.encode('hex'))
    2097152
    >>> len(sample.encode('ascii85'))
    1310720

    >>> len(zlib.compress(sample.encode('base64')))
    82169
    >>> len(zlib.compress(sample.encode('hex')))
    12537
    >>> len(zlib.compress(sample.encode('ascii85')))
    8212

Had to use Python 2.x to make use of the 'hackercodecs' package that implements ascii85, but looks like it's the best of both worlds, assuming its character set suits whatever medium you are transferring over, and decoding it on the other end doesn't require some slow code.

final edit: I'm guessing it was an accidental trick of the repetitive input data lining up well. On real data hex still wins out (which should have been obvious in hindsight):

    >>> sample = ''.join(sorted(open('/usr/share/dict/words').readlines()))

    >>> len(zlib.compress(sample.encode('ascii85')))
    1320809
    >>> len(zlib.compress(sample.encode('base64')))
    1116678
    >>> len(zlib.compress(sample.encode('hex')))
    880651

hackcasual · on Jan 18, 2018

Both hex and ascii85 will align at 1024 byte boundaries, base64 will align at 1024*3 bytes, so you'll end up with a symbol stream something like this:

symbol_1, symbol_2, symbol_3, ptr_to_1, ptr_to_2,...

ascii85 performs better than hex simply because it's dictionary is shorter.

If you compress something with a bit more structure, like "Hamlet", you'll get this:

    >>> import urllib.request,zlib,codecs,base64
    >>> hamlet = urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2265/pg2265.txt").read()
    >>> len(codecs.encode(hamlet, 'base64'))
    248577
    >>> len(codecs.encode(hamlet, 'hex'))
    368018
    >>> len(base64.a85encode(hamlet))
    230012
    >>> len(zlib.compress(codecs.encode(hamlet, 'base64')))
    102788
    >>> len(zlib.compress(codecs.encode(hamlet, 'hex')))
    88827
    >>> len(zlib.compress(base64.a85encode(hamlet)))
    121364

jwilk · on Jan 18, 2018

FWIW, Python (≥ 3.4) has an ascii85 implementation in the standard library:

    >>> import base64
    >>> base64.a85encode(b'spam')
    b'F)YQ)'

https://docs.python.org/3/library/base64.html#base64.a85enco...

andreareina · on Jan 18, 2018

Huh, my results are drastically different:

    $ dd if=/dev/urandom bs=512 count=2048 | base64 | gzip | wc
    ...
    4088   31928 1059085

    $ dd if=/dev/urandom bs=512 count=2048 | xxd -p | gzip | wc    ...
    5019   33798 1231268

1 megabyte of random data consistently results in ~1 megabyte of compressed base64 text, ~1.2 megabytes of compressed hex.

Twirrim · on Jan 18, 2018

Repeating the exercise using a photograph (https://imgur.com/B4tqkrZ):

    $ pv lock-your-screen.png | base64 | gzip | wc
     2.1MiB 0:00:00 [  13MiB/s] [================================>] 100%
       8641   48904 2264151


    $ pv lock-your-screen.png | xxd -p | gzip | wc
     2.1MiB 0:00:00 [4.41MiB/s] [================================>] 100%
      10109   49956 2573293

See the same against the jpg version (I've no idea why I have kept both a jpg and png of the same image around, especially given jpg is much better suited format):

    $ pv lock-your-screen.jpg | base64 | gzip | wc
     377KiB 0:00:00 [19.2MiB/s] [================================>] 100%
       1420    8373  392796


    $ pv lock-your-screen.jpg | xxd -p | gzip | wc
     377KiB 0:00:00 [9.36MiB/s] [================================>] 100%
       1487    8935  441077

In both cases base64 is both faster and compresses smaller.

BeeOnRope · on Jan 18, 2018

This test isn't very informative because both .png and .jpg are already compressed formats, with "better than gzip" strength so gzip/deflate isn't going to be able to compress the underlying data.

You only see some compression because gzip is just backing out some of the redundancy added by the hex or base64 encoding, and the way the huffman coding works favors base64 slightly.

Try with uncompressed data and you'll get a different result.

Your speed comparison seems disingenuous: you are benchmarking "xxd", a generalized hex dump tool, against base64, a dedicated base-64 library. I wouldn't expect their speeds to have any interesting relationship with best possible speed of a tuned algorithm.

There is little doubt that base-16 encoding is going to be very fast, and trivially vectorizable (in a much simpler way than base-64).

dnet · on Jan 18, 2018

> both .png and .jpg are already compressed formats, with "better than gzip" strength

FWIW, PNG and gzip both use the DEFLATE algorithm, so I wouldn't call PNG's compression "better than gzip".

Source: https://en.wikipedia.org/wiki/DEFLATE

> This has led to its widespread use, for example in gzip compressed files, PNG image files and the ZIP file format for which Katz originally designed it.

BeeOnRope · on Jan 18, 2018

Like any domain-specific algorithms PNG uses deflate as a final step after using image-specific filters which take advantage of typical 2D features. So in general png will do much better than gzip on image data, but it will generally always do at least as well (perhaps I should have said that originally). In particular, the worse case .png compression (e.g., if you pass it random data, or text or something) is to use the "no op" filter followed by the usual deflate compression, which will end up right around plain gzip.

Now at least as good is enough for my point: by compressing a .png file with gzip you aren't going to see additional compression in general. When compressing a base-64 or hex encoding .png file, the additional compression you see is largely only a result of removing the redundancy of the encoding, not any compression of the underlying image.

BeeOnRope · on Jan 19, 2018

Ooops, that should read "Like many domain-specific algorithms" not "Like any ..."

_wmd · on Jan 18, 2018

My data wasn't quite random! It repeats every 1kb, much smaller than zlib's window size (which is I think 16kb)

nkurz · on Jan 18, 2018

A good rule-of-thumb might be that if your results show that you able to consistently compress supposedly random data to less than the size required for just the random binary bits, you should either recheck your numbers, verify your random number generator, or quickly file for a patent!

jval43 · on Jan 18, 2018

You should add googling “Shannon” “entropy” and “information theory” to that list ;-)

jval43 · on Jan 18, 2018

Try with (much) more data, theoretically they should even out at some point.

gumby · on Jan 18, 2018

If you’re compressing it anyway why turn it into base64 or ascii hex? The compressed data will be binary anyway, so just compress the input data directly.

dnet · on Jan 18, 2018

One example are JSON/XML/HTML responses over HTTP: you need the 8-bit ("binary") data in an ASCII format to fit into JSON/XML/HTML, while HTTP provides gzip (or DEFLATE, Brotli, etc.) compression over the whole response if both the client (by including the "Accept-Encoding" header in the request) and the server implementations support it.

hackcasual · on Jan 18, 2018

Correct, we're sending data points as typed arrays inside a much larger JSON payload.

tzahola · on Jan 17, 2018

Most of the modern Web is unusable without a highly optimized JIT-compiling JS interpreter. Are you sure about the importance of interoperability & compatibility?

_wmd · on Jan 18, 2018

I'm not sure why you believe that is a counterexample - it is anything but. That interpreter can still run code written against the original DOM APIs in the original syntax, as defined 22 years ago - around time the <noscript> tag appeared in HTML!

Of course if mom and pop were using UUCP and an ADM-3A to fetch their mail, they couldn't view the images -- but all the infrastructure that allowed them to read mail at all could be repurposed without change, possibly including their modem (thanks, RS-232, introduced 1960), if they bought a 80386 desktop with a VGA card

These are all interoperability success stories! Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability

mschuster91 · on Jan 18, 2018

> Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability

IO was not ignored, it was willfully cast aside. That's the world of walled gardens we live in today... federation makes it possible for users to switch outside your walled garden.

ycombobreaker · on Jan 18, 2018

Yes, HTTP and all of the layers below are quite fundamental to the success of the WWW. Interoperability breeds competition and real innovation because nobody can lean on a network effect the way modern walled gardens do.

vvanders · on Jan 17, 2018

Not everything goes over the web and is gzipped.

I've been working on a project lately with 1200bps radio links that are making REST calls on the other end and 2 vs 3 bytes makes a massive difference.

tzahola · on Jan 17, 2018

And why don’t you just send it in binary? REST doesn’t mean you have to use JSON for the HTTP body.

vvanders · on Jan 17, 2018

Yeah, that would be ideal.

Some of the protocols we talk over(APRS) are ASCII only(which is a bit of point B from above).

Transmissions are also signed at the radio so that they can be authenticated down the line which means that you can't easily change the payload envelope once it leaves the radio.

Our main goal is adoption and talking base64 means that lots of APIs that would take an intermediate server to do the translation aren't needed. This lowers to barrier to use and means we have a larger surface area of use cases.

sd8dgf8ds8g8dsg · on Jan 18, 2018

It sounds like her use case is badly suited for raw binary transmission due to the high possibility of data loss, much like the internet used to be when base64 was originally introduced.

Veedrac · on Jan 17, 2018

Base64 on the web is such an unbelievably terrible solution to such a trivial problem that it beggars belief. The idea that parsing CSS somehow blocks on decompressing string delimited, massively inefficient, nonlinearly encoded blobs of potentially incompressible data is insane.

We've taken the most obvious latency path a normal user sees and somehow decided that mess was better than sending an ar file.

(Not that this wasn't inevitable as soon as someone decided CSS should be a text format... sigh)

bzbarsky · on Jan 18, 2018

Why should parsing CSS block on decompressing base64, unless the CSS itself is a base64-encoded data: URI?

If you're talking about data: background images in CSS, then all the CSS parser has to do is find the end of the url(). It doesn't have to do any base64 decoding or anything like that until the rule actually matches.

Veedrac · on Jan 18, 2018

You have to gz-decompress it, which is a harder job, and the only reason you're doing that (at least for images) is to undo the inefficiency you just added.

bzbarsky · on Jan 18, 2018

Ah, gz-decompress the stylesheet itself, ok.

For what it's worth, in my profiles of pageloads in at least Firefox I haven't seen gz-decompression of stylesheets show up in any noticeable way, but I can believe that it could be a problem if you have a lot of data: images in the sheet...

foota · on Jan 17, 2018

I once came across a site with that was base 64 encoding GET params for an incredibly insecure http route.

E.g., the request was something like http://example.com/bad/?v=YT0xMjMmYj00NTY= which they would then decode on the server to a=123&b=456