Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A new compression algorithm to make Google Chrome updates small (chromium.org)
105 points by l0stman on July 15, 2009 | hide | past | favorite | 25 comments


The code has been available in their public SVN repository for quite some time:

http://src.chromium.org/viewvc/chrome/trunk/src/courgette/

BTW, I've read about 40% of the Chrome code and I have to say it is the single best open-source project I've come to study so far. Incredibly clean code, really shows how much code reviews done properly can benefit project in the long run.


Well, the "cleanliness" is probably because it hasn't been open-source for too long - most of the original work was done by a dedicated team of software engineers most likely operating under fairly strict code guidelines.


As someone who works on a project with a team of dedicated engineers under fairly strict code guidelines including code reviews, I can tell you from experience none of these factors have to do with "code cleanliness" (as ours is not, at least in my opinion).


bsdiff was written by a HN regular, and I can't think of a better person to provide insight on this. (Hint.)


Did someone summon me?

I don't know all the details behind Courgette, but it sounds very similar to Exediff. In my thesis I show that bsdiff (well, a slightly improved version of bsdiff which I never got around to releasing publicly) performed on par with or slightly better than Exediff; so I'm surprised that Courgette is getting this far ahead of bsdiff.

Based on the numbers they've published (a 10 line source code patch to a 10MB executable resulting in a 700kB bsdiff patch), it sounds like there's something weird going on which is breaking bsdiff -- the FreeBSD kernel is roughly the same size as Chrome, but normally for a small (e.g., 10 line) source code patch I would see a bsdiff patch of between 50kB and 100kB.

Beyond that, I really can't offer any more insight -- unless someone wants to hire me to spend a week looking at the old and new binaries and figuring out where bsdiff is going wrong. :-)


Does this generalize to a moving "fixed offset" in the compressor? i.e. can't it be done without the disassembler? (Assuming they're parsing x86?)

x86 is variable length, so this isn't trivial, but surely it is possible. I guess I don't like the "assembler" step because, hmm, SSE7 might not look like x86.

Also, if you were willing to have the loader to do the work, it sounds like you could do a -fPIC sort of thing, and load most code as a DLL.

Very cool that they got it working, though.


Without engaging conspiracy theories, I think the most interesting part is here:

The small size in combination with Google Chrome's silent update means we can update as often as necessary to keep users safe.

[An extra tidbit- courgette is a 'summer squash'. Silly Google...]


I personally found an order-of-magnitude increase in the compression ratio for their executables to be more interesting than that.


They published a paper on how they intend to increase browser security. Silent updates was a big factor because, basically, most internet users don't have a clue what they're doing.

"Why Silent Updates Boost Security" http://www.techzoom.net/publications/silent-updates/


Technically, it's not a new compression algorithm. It's just a more efficient update scheme which uses a better suited binary diffing algorithm.


What you call "binary diffing" is called "delta compression" by people in the field -- and delta compression is considered to be a type of compression (just like image compression, audio compression, video compression, et cetera).


Perhaps what he meant is that it's not a new entropy coding algorithm; it's a new prediction algorithm that feeds data to an old entropy coding algorithm. Overall, it's still a new compression scheme though.


Right, from the title I was anticipating a new compression algorithm in the traditional sense like DEFLATE or LZW. This is a new differential compression algorithm; indeed still considered a compression algorithm.


It's a very impressive achievment, but I'm not sure I buy their motivation for doing the work. 70k vs 700k? It's a blink of an eye on today's Internet. Sounds more like a very bright engineer wanted to work on a cool project, then reverse reasoned his way back to this "smaller is faster" idea.


"If the update is a tenth of the size, we can push ten times as many per unit of bandwidth. We have enough users that this means more users will be protected earlier."


Yes I read the article as well. The issue is that Google probably has as close as you can get to an infinite amount of bandwidth, which makes this sort of a low priority item. I don't want to trivialize the technology; it's obviously very cool stuff, but it's the solution to a problem I doubt they ever had.

"Hi everyone, welcome to the Monday morning Chrome staff meeting. Um... so we've gotten some push back from management. They -- guys you're not going to believe this -- they think we might be using too much bandwidth. I tried to make a snarky joke about YouTube, but they weren't having any of it. So, um... we're gonna go ahead and pull two or three guys from the [INSERT IMPORTANT BROWSER COMPONENT THAT WILL LEAD TO A BETTER BROWSER HERE] team and have them work on making the updates smaller."


I have a data cap on my Internet connection. Also, a broadband connection in India is usually 256kbps (kilobits per second, not kilobytes). I myself have a 512kbps connection.

Since broadband penetration here is not very good, people who travel a lot and need to be online most of the time use mobile Internet, which is much slower and costs much much more.

So, what do you mean by "today's Internet"?


Even on your example of a typical Indian 256kbps circuit, a 700k update is only about 20s.

   http://www25.wolframalpha.com/input/?i=700+kilobytes/256+kbps
And it's going to happen in the background anyway. jQuery (as per jQuery.com) is nearly 20k minified and gzipped. So they have gone from 35 jQuery downloads to 3.5 jQuery downloads. The 10x factor is impressive, but in absolute terms I don't think it's really that noticeable.

"Today's Internet" is the one with jQuery, YouTube, Netflix streaming, last.fm, Flickr, and GMail. It's the one where most of us can get multiple mbps, or at least several hundred kbps, to our homes. It's the one where we dont notice the difference between 70k and 700k. It's actually quite nice. :-)


It's a huge difference if you were planning to constantly push live updates maybe multiple times a day to millions of users. We've seen how fast attacks can spread these days -- especially on social networks like Twitter where users easily fall into a false sense of security because they trust their friends. If Chrome was constantly streaming in either full blown fixes or just some basic attack prevention code like URL=X on DOMAIN=Y then -a href could really cut down on the spread of these attacks. Not a perfect system but much better than ignoring the problem entirely.


They might make it out as an advantage for the customer, but what they really mean is "fewer, smaller CDN servers." A YouTube video, say, might be requested quite a few times in a day/hour/minute, but when a web browser notices it has an update, they all try to update, all at once. A 10x reduction in size is going to help a bit with handling that peak load.


Say there are 3 million chrome users. 700 KB * 3 million = 2.1TB. That is quite a lot of bandwidth. They are not trying to reduce the costs. They are trying to minimize the time it will take to push an update to all users.

If they set the time interval of update check to 1 hour, then they have 1 hour to push out this data. If they can reduce the data by 10 times, they can schedule the update check more aggressively.


A lot of people are on surprisingly crappy connections.


This isn't a clever way to update to lynx, or to send firmware updates to a Mars rover (that would be really cool use case). It's for Google Chrome, the browser that Google built to handle the next generation (read as: bandwidth hungry) of web applications. Besides that, the browser update download typically happens in the background, so even if it is slow, I'd bet that the difference between 70k and 700k is still negligible.


A majority of the machines in africa are so virused up that they and are broadcasting so much wormjunk that there is barely any bandwidth available for things like updates.

Every bit helps.

Anyway, doing it as well as can be done is a good challenge.


There are a lot of people who use EDGE/3G without unlimited bandwidth plan. Also, in a lot of countries people have pay-for-traffic plans. Since Chrome autoupdates itself, and there's no way to disable this, saving kilobytes has good value.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: