I hate base64. If you have to embed binary data within human-readable text, then you’re either
A: doing something horribly wrong
B: or forced to use a monkey-patched legacy standard
Otherwise just use hexadecimal digits. At least it will be somewhat readable.
I know, I know... Base64 encodes 3 bytes on 4 characters, while hex encodes 2 bytes on 4. Big fucking deal. It will almost surely be gzipped down the line, and if you’re storing amounts of data where this would count, then see point “A” above.
I hope one day we will laugh at the mere idea of base64 encoding like how we do with EBCDIC or Windows-12XX codepages.
Base64 has been around forever, and hex has significantly lower information density than it - consistently more than 30% smaller.
As for doing something horribly wrong, I dunno, ensuring literally billions of moms and pops can exchange baby pictures by e-mail, for the first time in human history, and doing so since the time when we were using 33.6kbit modems, that seems like a resounding success.†
† Yes, if e-mail had not paid such close attention to compatibility, and just ripped up every old standard on every chance it got (i.e. pretty much like any technology invented since the commoditization of the web), interoperability would never have happened. Legacy crap is fundamentally a part of that success, and Base64 is fundamentally a part of making those hacks practical for the technology we had at the time††
†† To head off an obvious counterargument: yes, there are contemporary examples of why this strategy is still important (it's, so far, the only one that has ever worked in the decentralized environment of the Internet), and is very likely to come in useful in future, time and time again
Hex has lower information density, however it is more compressible, particularly if your data is byte aligned. For example the string "aaaaaa" is b64 encoded as "YWFhYWFh" but in hex is "616161616161". On a project I'm on, we switched from base64 to hex for some binary data embedded in JSON, and saw ~20% size reduction, since its always compressed by the web server.
Had to use Python 2.x to make use of the 'hackercodecs' package that implements ascii85, but looks like it's the best of both worlds, assuming its character set suits whatever medium you are transferring over, and decoding it on the other end doesn't require some slow code.
final edit: I'm guessing it was an accidental trick of the repetitive input data lining up well. On real data hex still wins out (which should have been obvious in hindsight):
See the same against the jpg version (I've no idea why I have kept both a jpg and png of the same image around, especially given jpg is much better suited format):
This test isn't very informative because both .png and .jpg are already compressed formats, with "better than gzip" strength so gzip/deflate isn't going to be able to compress the underlying data.
You only see some compression because gzip is just backing out some of the redundancy added by the hex or base64 encoding, and the way the huffman coding works favors base64 slightly.
Try with uncompressed data and you'll get a different result.
Your speed comparison seems disingenuous: you are benchmarking "xxd", a generalized hex dump tool, against base64, a dedicated base-64 library. I wouldn't expect their speeds to have any interesting relationship with best possible speed of a tuned algorithm.
There is little doubt that base-16 encoding is going to be very fast, and trivially vectorizable (in a much simpler way than base-64).
> This has led to its widespread use, for example in gzip compressed files, PNG image files and the ZIP file format for which Katz originally designed it.
Like any domain-specific algorithms PNG uses deflate as a final step after using image-specific filters which take advantage of typical 2D features. So in general png will do much better than gzip on image data, but it will generally always do at least as well (perhaps I should have said that originally). In particular, the worse case .png compression (e.g., if you pass it random data, or text or something) is to use the "no op" filter followed by the usual deflate compression, which will end up right around plain gzip.
Now at least as good is enough for my point: by compressing a .png file with gzip you aren't going to see additional compression in general. When compressing a base-64 or hex encoding .png file, the additional compression you see is largely only a result of removing the redundancy of the encoding, not any compression of the underlying image.
A good rule-of-thumb might be that if your results show that you able to consistently compress supposedly random data to less than the size required for just the random binary bits, you should either recheck your numbers, verify your random number generator, or quickly file for a patent!
If you’re compressing it anyway why turn it into base64 or ascii hex? The compressed data will be binary anyway, so just compress the input data directly.
One example are JSON/XML/HTML responses over HTTP: you need the 8-bit ("binary") data in an ASCII format to fit into JSON/XML/HTML, while HTTP provides gzip (or DEFLATE, Brotli, etc.) compression over the whole response if both the client (by including the "Accept-Encoding" header in the request) and the server implementations support it.
Most of the modern Web is unusable without a highly optimized JIT-compiling JS interpreter. Are you sure about the importance of interoperability & compatibility?
I'm not sure why you believe that is a counterexample - it is anything but. That interpreter can still run code written against the original DOM APIs in the original syntax, as defined 22 years ago - around time the <noscript> tag appeared in HTML!
Of course if mom and pop were using UUCP and an ADM-3A to fetch their mail, they couldn't view the images -- but all the infrastructure that allowed them to read mail at all could be repurposed without change, possibly including their modem (thanks, RS-232, introduced 1960), if they bought a 80386 desktop with a VGA card
These are all interoperability success stories! Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability
> Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability
IO was not ignored, it was willfully cast aside. That's the world of walled gardens we live in today... federation makes it possible for users to switch outside your walled garden.
Yes, HTTP and all of the layers below are quite fundamental to the success of the WWW. Interoperability breeds competition and real innovation because nobody can lean on a network effect the way modern walled gardens do.
I've been working on a project lately with 1200bps radio links that are making REST calls on the other end and 2 vs 3 bytes makes a massive difference.
Some of the protocols we talk over(APRS) are ASCII only(which is a bit of point B from above).
Transmissions are also signed at the radio so that they can be authenticated down the line which means that you can't easily change the payload envelope once it leaves the radio.
Our main goal is adoption and talking base64 means that lots of APIs that would take an intermediate server to do the translation aren't needed. This lowers to barrier to use and means we have a larger surface area of use cases.
It sounds like her use case is badly suited for raw binary transmission due to the high possibility of data loss, much like the internet used to be when base64 was originally introduced.
Base64 on the web is such an unbelievably terrible solution to such a trivial problem that it beggars belief. The idea that parsing CSS somehow blocks on decompressing string delimited, massively inefficient, nonlinearly encoded blobs of potentially incompressible data is insane.
We've taken the most obvious latency path a normal user sees and somehow decided that mess was better than sending an ar file.
(Not that this wasn't inevitable as soon as someone decided CSS should be a text format... sigh)
Why should parsing CSS block on decompressing base64, unless the CSS itself is a base64-encoded data: URI?
If you're talking about data: background images in CSS, then all the CSS parser has to do is find the end of the url(). It doesn't have to do any base64 decoding or anything like that until the rule actually matches.
You have to gz-decompress it, which is a harder job, and the only reason you're doing that (at least for images) is to undo the inefficiency you just added.
For what it's worth, in my profiles of pageloads in at least Firefox I haven't seen gz-decompression of stylesheets show up in any noticeable way, but I can believe that it could be a problem if you have a lot of data: images in the sheet...
A: doing something horribly wrong
B: or forced to use a monkey-patched legacy standard
Otherwise just use hexadecimal digits. At least it will be somewhat readable.
I know, I know... Base64 encodes 3 bytes on 4 characters, while hex encodes 2 bytes on 4. Big fucking deal. It will almost surely be gzipped down the line, and if you’re storing amounts of data where this would count, then see point “A” above.
I hope one day we will laugh at the mere idea of base64 encoding like how we do with EBCDIC or Windows-12XX codepages.