Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I hate base64. If you have to embed binary data within human-readable text, then you’re either

A: doing something horribly wrong

B: or forced to use a monkey-patched legacy standard

Otherwise just use hexadecimal digits. At least it will be somewhat readable.

I know, I know... Base64 encodes 3 bytes on 4 characters, while hex encodes 2 bytes on 4. Big fucking deal. It will almost surely be gzipped down the line, and if you’re storing amounts of data where this would count, then see point “A” above.

I hope one day we will laugh at the mere idea of base64 encoding like how we do with EBCDIC or Windows-12XX codepages.



Base64 has been around forever, and hex has significantly lower information density than it - consistently more than 30% smaller.

As for doing something horribly wrong, I dunno, ensuring literally billions of moms and pops can exchange baby pictures by e-mail, for the first time in human history, and doing so since the time when we were using 33.6kbit modems, that seems like a resounding success.†

† Yes, if e-mail had not paid such close attention to compatibility, and just ripped up every old standard on every chance it got (i.e. pretty much like any technology invented since the commoditization of the web), interoperability would never have happened. Legacy crap is fundamentally a part of that success, and Base64 is fundamentally a part of making those hacks practical for the technology we had at the time††

†† To head off an obvious counterargument: yes, there are contemporary examples of why this strategy is still important (it's, so far, the only one that has ever worked in the decentralized environment of the Internet), and is very likely to come in useful in future, time and time again


Hex has lower information density, however it is more compressible, particularly if your data is byte aligned. For example the string "aaaaaa" is b64 encoded as "YWFhYWFh" but in hex is "616161616161". On a project I'm on, we switched from base64 to hex for some binary data embedded in JSON, and saw ~20% size reduction, since its always compressed by the web server.


Today I learned!

    >>> import random,zlib,codecs
    >>> sample = random.getrandbits(1024).to_bytes(1024, 'big') * 1024
    >>> len(zlib.compress(codecs.encode(sample, 'base64')))
    22083
    >>> len(zlib.compress(codecs.encode(sample, 'hex')))
    4326
    >>>
edit: (25 minutes of digging around the Internet, and I find this):

    >>> sample = os.urandom(1024) * 1024

    >>> len(sample.encode('base64'))
    1416501
    >>> len(sample.encode('hex'))
    2097152
    >>> len(sample.encode('ascii85'))
    1310720

    >>> len(zlib.compress(sample.encode('base64')))
    82169
    >>> len(zlib.compress(sample.encode('hex')))
    12537
    >>> len(zlib.compress(sample.encode('ascii85')))
    8212
Had to use Python 2.x to make use of the 'hackercodecs' package that implements ascii85, but looks like it's the best of both worlds, assuming its character set suits whatever medium you are transferring over, and decoding it on the other end doesn't require some slow code.

final edit: I'm guessing it was an accidental trick of the repetitive input data lining up well. On real data hex still wins out (which should have been obvious in hindsight):

    >>> sample = ''.join(sorted(open('/usr/share/dict/words').readlines()))

    >>> len(zlib.compress(sample.encode('ascii85')))
    1320809
    >>> len(zlib.compress(sample.encode('base64')))
    1116678
    >>> len(zlib.compress(sample.encode('hex')))
    880651


Both hex and ascii85 will align at 1024 byte boundaries, base64 will align at 1024*3 bytes, so you'll end up with a symbol stream something like this:

symbol_1, symbol_2, symbol_3, ptr_to_1, ptr_to_2,...

ascii85 performs better than hex simply because it's dictionary is shorter.

If you compress something with a bit more structure, like "Hamlet", you'll get this:

    >>> import urllib.request,zlib,codecs,base64
    >>> hamlet = urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2265/pg2265.txt").read()
    >>> len(codecs.encode(hamlet, 'base64'))
    248577
    >>> len(codecs.encode(hamlet, 'hex'))
    368018
    >>> len(base64.a85encode(hamlet))
    230012
    >>> len(zlib.compress(codecs.encode(hamlet, 'base64')))
    102788
    >>> len(zlib.compress(codecs.encode(hamlet, 'hex')))
    88827
    >>> len(zlib.compress(base64.a85encode(hamlet)))
    121364


FWIW, Python (≥ 3.4) has an ascii85 implementation in the standard library:

    >>> import base64
    >>> base64.a85encode(b'spam')
    b'F)YQ)'
https://docs.python.org/3/library/base64.html#base64.a85enco...


Huh, my results are drastically different:

    $ dd if=/dev/urandom bs=512 count=2048 | base64 | gzip | wc
    ...
    4088   31928 1059085

    $ dd if=/dev/urandom bs=512 count=2048 | xxd -p | gzip | wc    ...
    5019   33798 1231268
1 megabyte of random data consistently results in ~1 megabyte of compressed base64 text, ~1.2 megabytes of compressed hex.


Repeating the exercise using a photograph (https://imgur.com/B4tqkrZ):

    $ pv lock-your-screen.png | base64 | gzip | wc
     2.1MiB 0:00:00 [  13MiB/s] [================================>] 100%
       8641   48904 2264151


    $ pv lock-your-screen.png | xxd -p | gzip | wc
     2.1MiB 0:00:00 [4.41MiB/s] [================================>] 100%
      10109   49956 2573293

See the same against the jpg version (I've no idea why I have kept both a jpg and png of the same image around, especially given jpg is much better suited format):

    $ pv lock-your-screen.jpg | base64 | gzip | wc
     377KiB 0:00:00 [19.2MiB/s] [================================>] 100%
       1420    8373  392796


    $ pv lock-your-screen.jpg | xxd -p | gzip | wc
     377KiB 0:00:00 [9.36MiB/s] [================================>] 100%
       1487    8935  441077
In both cases base64 is both faster and compresses smaller.


This test isn't very informative because both .png and .jpg are already compressed formats, with "better than gzip" strength so gzip/deflate isn't going to be able to compress the underlying data.

You only see some compression because gzip is just backing out some of the redundancy added by the hex or base64 encoding, and the way the huffman coding works favors base64 slightly.

Try with uncompressed data and you'll get a different result.

Your speed comparison seems disingenuous: you are benchmarking "xxd", a generalized hex dump tool, against base64, a dedicated base-64 library. I wouldn't expect their speeds to have any interesting relationship with best possible speed of a tuned algorithm.

There is little doubt that base-16 encoding is going to be very fast, and trivially vectorizable (in a much simpler way than base-64).


> both .png and .jpg are already compressed formats, with "better than gzip" strength

FWIW, PNG and gzip both use the DEFLATE algorithm, so I wouldn't call PNG's compression "better than gzip".

Source: https://en.wikipedia.org/wiki/DEFLATE

> This has led to its widespread use, for example in gzip compressed files, PNG image files and the ZIP file format for which Katz originally designed it.


Like any domain-specific algorithms PNG uses deflate as a final step after using image-specific filters which take advantage of typical 2D features. So in general png will do much better than gzip on image data, but it will generally always do at least as well (perhaps I should have said that originally). In particular, the worse case .png compression (e.g., if you pass it random data, or text or something) is to use the "no op" filter followed by the usual deflate compression, which will end up right around plain gzip.

Now at least as good is enough for my point: by compressing a .png file with gzip you aren't going to see additional compression in general. When compressing a base-64 or hex encoding .png file, the additional compression you see is largely only a result of removing the redundancy of the encoding, not any compression of the underlying image.


Ooops, that should read "Like many domain-specific algorithms" not "Like any ..."


My data wasn't quite random! It repeats every 1kb, much smaller than zlib's window size (which is I think 16kb)


A good rule-of-thumb might be that if your results show that you able to consistently compress supposedly random data to less than the size required for just the random binary bits, you should either recheck your numbers, verify your random number generator, or quickly file for a patent!


You should add googling “Shannon” “entropy” and “information theory” to that list ;-)


Try with (much) more data, theoretically they should even out at some point.


If you’re compressing it anyway why turn it into base64 or ascii hex? The compressed data will be binary anyway, so just compress the input data directly.


One example are JSON/XML/HTML responses over HTTP: you need the 8-bit ("binary") data in an ASCII format to fit into JSON/XML/HTML, while HTTP provides gzip (or DEFLATE, Brotli, etc.) compression over the whole response if both the client (by including the "Accept-Encoding" header in the request) and the server implementations support it.


Correct, we're sending data points as typed arrays inside a much larger JSON payload.


Most of the modern Web is unusable without a highly optimized JIT-compiling JS interpreter. Are you sure about the importance of interoperability & compatibility?


I'm not sure why you believe that is a counterexample - it is anything but. That interpreter can still run code written against the original DOM APIs in the original syntax, as defined 22 years ago - around time the <noscript> tag appeared in HTML!

Of course if mom and pop were using UUCP and an ADM-3A to fetch their mail, they couldn't view the images -- but all the infrastructure that allowed them to read mail at all could be repurposed without change, possibly including their modem (thanks, RS-232, introduced 1960), if they bought a 80386 desktop with a VGA card

These are all interoperability success stories! Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability


> Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability

IO was not ignored, it was willfully cast aside. That's the world of walled gardens we live in today... federation makes it possible for users to switch outside your walled garden.


Yes, HTTP and all of the layers below are quite fundamental to the success of the WWW. Interoperability breeds competition and real innovation because nobody can lean on a network effect the way modern walled gardens do.


Not everything goes over the web and is gzipped.

I've been working on a project lately with 1200bps radio links that are making REST calls on the other end and 2 vs 3 bytes makes a massive difference.


And why don’t you just send it in binary? REST doesn’t mean you have to use JSON for the HTTP body.


Yeah, that would be ideal.

Some of the protocols we talk over(APRS) are ASCII only(which is a bit of point B from above).

Transmissions are also signed at the radio so that they can be authenticated down the line which means that you can't easily change the payload envelope once it leaves the radio.

Our main goal is adoption and talking base64 means that lots of APIs that would take an intermediate server to do the translation aren't needed. This lowers to barrier to use and means we have a larger surface area of use cases.


It sounds like her use case is badly suited for raw binary transmission due to the high possibility of data loss, much like the internet used to be when base64 was originally introduced.


Base64 on the web is such an unbelievably terrible solution to such a trivial problem that it beggars belief. The idea that parsing CSS somehow blocks on decompressing string delimited, massively inefficient, nonlinearly encoded blobs of potentially incompressible data is insane.

We've taken the most obvious latency path a normal user sees and somehow decided that mess was better than sending an ar file.

(Not that this wasn't inevitable as soon as someone decided CSS should be a text format... sigh)


Why should parsing CSS block on decompressing base64, unless the CSS itself is a base64-encoded data: URI?

If you're talking about data: background images in CSS, then all the CSS parser has to do is find the end of the url(). It doesn't have to do any base64 decoding or anything like that until the rule actually matches.


You have to gz-decompress it, which is a harder job, and the only reason you're doing that (at least for images) is to undo the inefficiency you just added.


Ah, gz-decompress the stylesheet itself, ok.

For what it's worth, in my profiles of pageloads in at least Firefox I haven't seen gz-decompression of stylesheets show up in any noticeable way, but I can believe that it could be a problem if you have a lot of data: images in the sheet...


I once came across a site with that was base 64 encoding GET params for an incredibly insecure http route.

E.g., the request was something like http://example.com/bad/?v=YT0xMjMmYj00NTY= which they would then decode on the server to a=123&b=456




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: