Hmm. The author's recommendation to use Redis' 'KEYS' command to pattern match keys for invalidation is a dangerous one. 'KEYS' runs in O(n) time. If you're using Redis in a serious production environment, you do not want want to be using an O(n) command each time you need to invalidate a group of keys. It's better to group related items in a Redis hash table, and that way they are all stored under one common key.
Honestly, I think this article is a more impoverished version of antirez's post on the same topic [1]. antirez, being one of the principal authors of Redis, is a much more authoritative source, and he actually describes all the patterns that this author described in greater detail.
Thanks for linking the antirez article, I actually haven't seen that one before I wrote the article. I put a link to it in the article.
As far as they keys thing, it's obviously been a major point of contention with my article. I know the warning, I just don't think it's actually a problem for cache invalidation.
See below on another comment of someone having similar concerns for my explanation why.
Sorry, I didn't mean to come across as a dick. I'm glad that you've taken the feedback of your readers into consideration.
I'm a big fan of Redis, and it's a key component of our stack. Sorted sets are useful for a lot more than just leader boards, though that is a good use case for them. It's a bit late here, so I'm not feeling up to writing a big post, but I'm considering writing my own blog post on my experiences with Redis.
Now, this is where Redis comes in. You can find keys on wildcards! So you can just query it like so:
keys post/83/*
No no no. This is slow and there is a large warning section in the notes (http://redis.io/commands/keys) about using this in production environments.
[Edit: Why is this bad? Think about if you have millions of keys in your environment. KEYS will need to iterate over a million keys to find the ones that match your pattern.]
As an alternative to using KEYS, Redis provides the hash object. Store everything abou an object in a hash (e.g.: "posts:83") and then just delete the hash object. Everything under it will be removed as well. If you need to know what's in the hash that is going to be deleted, use the HKEYS command (which has no such warning about performance).
Right, I know what the docs say, however, it's fine to use this for cache invalidation (but maybe not other use-cases) in production.
The docs also say:
"While the time complexity for this operation is O(N), the constant times are fairly low. For example, Redis running on an entry level laptop can scan a 1 million key database in 40 milliseconds."
If you have so many keys that this is an issue for cache invalidation, you should be using memcached anyways. (Since it can be distributed, where in Redis distribution is left up to you to figure out)
I wasn't able to dig it up, but I know I read an article about some consulting group making a site for a major shoe company where they did exactly this.
Your hash method isn't the best way either though since it's more efficient to store everything in individual key-value pairs, and hashes cannot be nested.
Really the BEST way to do this in Redis is to use a set containing all they keys related to an object, then clear each of them out when destroying an item.
Why is it more efficient to store everything in individual k-v pairs? Hash tables are certainly more memory efficient, and the difference in CPU efficiency is so slight as to be inconsequential.
I'm curious why you would use Redis instead of just using in-memory datastructures in your app servers? It's trivially easy to implement a leaderboard as a priority queue, for example. And it eliminates the need to run yet another server and deal with the associated RPC & command parsing overhead.
That's the approach Hacker News and Viaweb took, along with Mailinator and probably several other startups.
I'm curious why you would use Redis instead of just using in-memory datastructures in your app servers?
I make fairly pedestrian use of Redis, generally as either a persistent cache, shared memory, or schemaless DB shared by multiple Rails processes. In-memory structures have a lot to anti-recommend them in the Rails world: at any given time I have 4 server processes and 2 worker processes running, and each of them would need a separate copy of everything. There would be consistency problems. Those processes have a lifetime measured in days in the best of cases to minutes in the worst of cases: following a restart, any in-memory structures have to be rebuilt from the underlying data source. Hypothetically assuming demand for my products explodes and I can no longer deal with only a single physical server, Redis plays very well with being accessed from multiple servers, whereas I'd have to write some sort of REST API to reimplement Redis (poorly) on top of my actual people-pay-money-for-this application code to share that state among multiple physical servers, if I were to go down that route.
Redis has been an absolute dream to administrate: the total overhead for me was "apt-get install redis-server", adding three lines of configuration to Rails and tweaking two in Redis, and doing one SCP command when I migrated servers. The RPC/command parsing overhead is, empirically, negligible in my use cases. Don't take this advice if you're Google (I know you're Google, but for the general "you" here), but many people are not Google.
Ah, I was kinda assuming that there's already a separate appserver tier distinct from the webservers. If you don't have that, I can see how something like Redis might be a useful intermediate step so you don't have to go build one until there's actually a substantive need for it.
It's kinda like a complement to memcached then, right? Memcached gives you an off-the-shelf distributed hashtable that you can stick things in. Redis gives you an off-the-shelf list or heap server that you can stick things in. You might eventually want more control of the algorithms that you can run on these, but if it's not yet worth setting up a separate server, you can glue these components together and get a decent approximation.
Ah, I was kinda assuming that there's already a separate appserver tier distinct from the webservers.
That's kinda an enterprise-y architecture choice in my experience. There's excellent reasons for it (much like Service Oriented Architecture) but I generally see folks evolve into it over time rather than starting from it, unless they come from an enterprise-y background where its assumed from the beginning. In particular, Rails and some other opinionated frameworks start from the assumption that 99%+ of the business logic is going to get executed in the web tier, and while I'll bet you that some of the more famous Rails deployments eventually move away from that, Rails would fight you every step of the way if you were trying to do it in greenfield development.
Redis makes a great complement (or drop-in replacement depending on use case) for memcached. Relatedly, I love how these (and other OSS tools) let little guys play with big boy solutions without having to have big boy budgets or organizational resources to make use of them. I think Facebook probably has about 10 terabytes more memcached than I do, but it turns out that memcached is really freaking useful way down the scaling/complexity curve, too.
Redis and memcached aren't really that similar. Memcached is a key-value cache; it will evict items that haven't been used recently, and it has no persistence. Redis actually makes a serious effort to not silently lose data. (Unless you tell it to.)
> why you would use Redis instead of just using in-memory datastructures in your app servers?
If you mean actually storing the data (scores on a leaderboard) on the app servers, the problem is it can only scale so far. If data/state is stored on app servers you can't load balance across multiple servers. HN runs on one server, and it has been hitting scalability issues lately. It's also harder to do high-availability, if everything runs on one server there's nothing to fail-over to.
Wouldn't this apply to a Redis deployment as well? At the point where the app server would fall over, the Redis instance is probably getting just about saturated as well. (Well, perhaps a bit later because Redis is written in optimized C instead of Python/Java/Scala, but still no more than a constant factor away.) So you'd need multiple Redis instances. How is coherency between them handled? Does the client library automatically take care of synchronizing writes to multiple instances and failing over reads, or do you have to do all that yourself?
The difference is that relational data is usually disk-based, I/O bound, and requires persistence but not necessarily fast access. There're a bunch of algorithms that are specialized for the access patterns of disks (who wants to implement their own B-trees and transaction logs, other than Google?), so it makes sense to use an off-the-shelf solution for them.
Redis's main selling point is being an in memory datastore, which is great. But virtually every programming language has a rich selection of in-memory data structures in its standard library, along with the ability to write code and implement some more. What is it that Redis gives you over using these? Programmers are generally quite familiar with efficient algorithms for accessing memory - it tends to be taught in intro CS.
Do you know how Hacker News persists changes to its in-memory data structures? Does it snapshot every few seconds or journal every change? Does it keep comments in memory or swap them as needed?
The idea is that you coordinate concurrent read/write to the same structures, yielding a shared data view without formal coordination. Also, the structures may be large enough that you don't want them all in-memory for each client.
I was kinda assuming that the system already has a separate appserver tier from the webserver tier (HN doesn't, but many other real-world apps would). Separating application logic from HTML formatting is usually a good idea, if only because HTML templating tends to be CPU-intensive but memory-light while application logic is often CPU-light but memory-intensive. That's an orthogonal issue, though - you can run as many appserver instances as are necessary for your dataset.
I guess I was wondering why, in your app server, you don't just add a big in-process heap and use the normal language mechanisms to access it wherever you'd return your leaderboard info?
The concurrency issue is interesting - how does Redis handle it? Does it have some sort of STM, or is it all because everything executes in a single thread in Redis? If it's the latter, you'd get that for free in a single-threaded appserver (although you probably don't want a single-threaded appserver).
I use redis a lot to store non critical data. For instance I store signup confirmation tokens in redis. The web app sends a message to RabbitMQ when a user signs up, then a background worker catches that message, creates an activation token, stores it in redis, and sends an email to the user. You can set an expiration time on keys too. It's convenient shared memory.
"It has some persistence support, but does not appear to be super durable. If you're thinking of using like that though, you're misunderstanding the tool."
This warrants a much more detailed explanation. The author should have spoken about possible durability options, like the append-only file, which, given the right configuration, makes Redis "fully-durable" at the inevitable sacrifice of some speed.
We want also work both in the communication (most users don't understand that Redis with both AOF and RDB enabled is very durable already, and this is the setup we suggest) and the implementation to make sure that Redis AOF can be a very durable solution, as durable as the best SQL databases out there.
If I was greenfielding a new project tomorrow, I'd use a Redis/MySQL combo. MySQL for all the data in perfect first-normal form, and then Redis for storing difficult joins, caches, queues, etc. It would be a perfect marriage.
For some reason, people seem to think it's a key-value store, or some persistent database, but that's totally not it at all.
From what I understand, it is actually a key-value store and is basically a superset of memcached. Therefore, if my understanding is true you could use it merely as a key-value store as well and use its other features (native sets, lists, etc, pub-sub, and persistance) as needed.
I use Redis as a distributed locking mechanism too, especially with its setex feature which can help reduce deadlocks. We have a UUID string as the value for the key, and the only way to release the lock is that the UUID must be passed and matched against the value in Redis.
Valid article, mostly. But the point is still valid, and i feel that most people using Redis do not get it, it is a data structure server, it's not that hard to understand...
In order to argue that redis isn't a database you have to first define what a database is and explain why redis doesn't fit that definition. All the author did was assert that redis isn't a database and give 3 use cases.
This is just bad writing, to be blunt.
And for what it's worth antirez wrote an HN clone using redis as the database.
Honestly, I think this article is a more impoverished version of antirez's post on the same topic [1]. antirez, being one of the principal authors of Redis, is a much more authoritative source, and he actually describes all the patterns that this author described in greater detail.
[1] http://antirez.com/post/take-advantage-of-redis-adding-it-to...