Reddit explains their architecture, scaling and recent downtime

KrisJordan · on March 1, 2010

"It turns out that this one little decision makes it so that we can't horizontally scale that layer of our architecture without losing all the data already there (because all of the keys would point to the wrong server if we added a new one)."

I'm surprised they're looking to switch stacks entirely as opposed to a consistent hashing / distributed hash table (a la chord, dynamo, etc.)

http://en.wikipedia.org/wiki/Consistent_hashing

http://en.wikipedia.org/wiki/Chord_(peer-to-peer)

jedberg · on March 2, 2010

We can't just switch hashing at this point, because none of the data would be in the right place.

Since we have to move all the data anyway, we figured now would be a good time to switch stacks. Memcachedb isn't really a very solid product, so even if we scale it, it isn't a good long term solution.

showerst · on March 1, 2010

I love how transparent reddit is about these sorts of things. In their video from pycon, they even discussed their server hosting bills!

Crazy how such a minor design decision (not using consistent hashing) could have such huge consequences down the line...

kscaldef · on March 1, 2010

"[memcached] can no longer return data fast enough for our needs, due to the way it interacts with BDb (its underlying data store)."

Does anyone know what the specific issue being alluded to here is?

patio11 · on March 2, 2010

MemcacheDB periodically flushes its in-memory DB to disk, using BDb. That causes a global read/write lock for a looooong time. (This pauses BCC for 5 seconds at a time when it happens, and I only have 20 MB stored in memcachedb. To rectify this I'm moving to Redis as soon as I get a day free. Suffice it to say that if Bingo Card Creator taxes the architecture of your key/value store, you may not quite be ready for prime time.)

antirez · on March 2, 2010

I may be a bit biased, but if what they want is a fast persistent cache Redis should really be a good fit.

zepolen · on March 3, 2010

Does Redis block when writing to disk?

antirez · on March 4, 2010

No writes are asynchronous.

jedberg · on March 2, 2010

Yeah, that was a typo. Memcachedb just isn't fast enough. It used blocking IO when reading the disk, so if it is waiting on the disk for some data, none of the other requests can go through.

sadiq · on March 1, 2010

They're talking about memcachedb here rather than memcached.

It's memcache but with berkeleydb underneath for persistence.

http://memcachedb.org/

synnik · on March 2, 2010

"It is a highly customized user experience, on par with something like Facebook (just not as many users)"

Really, Reddit?

While, I don't doubt that they have complexities to deal with, this sounds like skewed perspective of either overestimating their own complexity, or underestimating Facebook's.

jedberg · on March 2, 2010

We have had multiple discussions with the Facebook folks, and are well aware of the complexities of both sites. I believe you are actually underestimating reddit's complexity.

Here are some examples:

When you load a comments page with 500 comments, there are 1000s of data points that have to be loaded. For every comment, we have to check how you voted on it to draw the arrows, we have to check if you are the author, we have to check if you are allowed to remove that comment as a moderator, we have to check if you can edit it, if the author is your friend, and so on. And we have to do that 500 times.

When you load a listing, we have to pull your subscriptions, and then merge the results from all the reddits you subscribe to. Then we have to check most of the same things as we do for comments.

After all that checking, we have a to render a page that is customized just for you. Some of that will come from the render cache and some from the data cache, but it is still highly customized.

The big difference is that Facebook doesn't have any logged out users, whereas we do.

Akamai takes care of our logged out users, but rendering the page for a logged in user is extremely complicated.

synnik · on March 2, 2010

Thanks for the reply --

And I'm not trying to be difficult, but... I'm not convinced.

In your scenario, the 500 comments are already loaded for any user. The comments you have authored are a small dataset, not a check for each and every comment. (At least, I HOPE not.) Similarly for your friends comments... and whether or not you are a mod is a single check.

For the average thread and user, this really should be a handful of checks. Surely, there are some very active users with many friends for whom the checks could hit numbers in the 1000s. But that should be the edge case, shouldn't it?

Same thing for loading a listing. Pulling subscriptions and merging sounds like just DB query design. All the "same things as we do for comments" should be minimized as described above.

Rending a page is just that -- rendering... highly customized, yes. But customized on the previously discussed criteria, not a while new set of customizations.

And like you said, if you aren't logged in, none of this applies.

Which all begs two questions: 1) What percentage of users are logged in? 2) What percentage of page renders really fit the scenario you described above, with enough comments and related data points that the checks hit 4 digits?

If you come back and say 90% of your page renders needs 1000 data points, I'd certainly concede the point. But I'd be very surprised. (And highly suggest that new algorithms are needed.)

But I'm sincerely honest what those percentages are?

papaf · on March 2, 2010

Two things surprise me about this article - probably because I've misunderstood it and don't see the big picture.

One is that there are master and slave databases and searches are done off the master - I've always seen them done off the slaves in other systems. The other is that they state that using MD5 doesn't allow for horizontal scaling. One of the qualities of MD5 is that all bits have an equal probability of being 0/1. Surely the last 1 or 2 bits can be used to indicate which server is holding the data?

smanek · on March 2, 2010

Searches are likely done off slaves - I suspect that is not presented properly because of the oversimplification of the diagram.

You can just use a few bits from an MD5 hash to decide server as long as you know how many servers you're going to have up front. The problem is that if you later wanted to add or remove a server, you would need to come up with a new scheme and move every piece of data around so it's on the right server (which would take days/weeks).

The more scalable/flexible solution is to use a consistent hashing algorithm (check out some of the papers on Chord) so that adding or removing a server doesn't require you to move as much data around.

jedberg · on March 2, 2010

The search machine is its own database. It feeds its data from the masters for consistency, but the searches themselves run against the search database.

I think the MD5 thing was covered well below.