The advice I've heard from serious people using Elasticsearch for serious things...

atombender · on May 2, 2015

This is true. On the other hand, even a secondary data store that's considered "lossy" poses a challenge — how do you know if its integrity has been compromised?

In other words, if you're firehosing your primary data store into ElasticSearch, you'll want to know whether it's got all the data you pushed to it at any given time.

I suppose you could use some kind of heuristic to detect this, like posting a "checksum" document occasionally that contains the indexing state and thus acts as a canary that lets you detect loss. On the other hand, this document would be sharded, so you'd want one such document per shard. Is this a solved problem?

fizx · on May 2, 2015

A logical "SELECT COUNT(*) WHERE updated_at < now()" is probably reasonably fast on your primary store and ElasticSearch.

atombender · on May 2, 2015

Given that ElasticSearch is "eventually consistent", how do you know when it has caught up? How do you know what data is missing once the count is wrong?

It's solveable, of course, but it's a problem that pops up with any synchronization system, and I'm surprised nobody (apparently) has written one, because it requires a fairly good state machine that can also compute diffs. Once a store grows to a certain state, you do not want to trigger full syncs, ever.

The best, most trivial (in terms of complexity and resilience) solution I have found is to sync data in batches, give each batch an ID, and record each batch both in the target (eg., ElasticSearch) and in a log that belongs to the synchronization process. The heuristic is then to compute a difference beetween the two logs to see how far you need to catch up.

This will only work in a sequential fashion; if ElasticSearch loses random documents, it won't be picked up by such a system. You could fix this by letting each log entry store the list of logical updates, checksummed; and then do regular (eg., nightly) "inventory checks".

fizx · on May 2, 2015

> Given that ElasticSearch is "eventually consistent", how do you know when it has caught up?

That's nice, in practice, though, ElasticSearch, doesn't behave like an eventually consistent system--it behaves like a flawed fully consistent system. It doesn't self-repair enough to be eventually consistent. If you get out of sync by more than a few seconds, you're going to have to repair the system manually in some fashion. It never "catches up."

Additionally, also in practice, most of the data loss (and real eventual consistency behavior) you'll see in an ElasticSearch+primary-data-store system isn't coming from within ES--it's coming from queues people typically use in the sync process. So there's a degree where you're going to need to handle this on an application-specific basis.

> How do you know what data is missing once the count is wrong?

In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.

> Once a store grows to a certain state, you do not want to trigger full syncs, ever.

This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.

atombender · on May 2, 2015

Of course you need to be able to do full syncs, and the sync is not a problem. But one needs to solve the two challenges I have described:

1. Determine how to do an incremental update, given that only the tail of the stream of updated documents is missing. Not as simple as just counting.

2. Determine when you must give up and fall back to a full sync; this is when not just the tail is missing, and finding the difference is computationally non-trivial. You'll only want to do this once you're sure that you need to.

My point remains that ElasticSearch's consistency model means it's hard to even do #1, which is the day-to-day streaming updates.

My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue.

(In my experience, queues are terrible at this. One problem is that it's hard to express different priorities this way. If you have a batch job that touches 1 million database rows, you don't want these to fill your "real-time" queue with pending indexing operations. Using multiple queues leaves you open to odd inconsistencies when updates are applied out of order. And so on. Polling triggered by notifications tends to be better.)

fizx · on May 3, 2015

> "My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue."

This is almost exactly the cross-DC replication problem, which is a subject of active research.

A changelog on the source side is only sort of helpful. It's useful to advise which rows may have changed, but given that you don't trust the target database, you also need to do repair.

Correct repair is impossible without full syncs (or at least partial sync since known-perfectly-synced snapshot), unless your data model is idempotent and commutative. On-the-fly repair requires out-of-order re-application of previous operations.

The easiest way to reason about commutivity is to just make everything in your database immutable. So this is a solved problem, but it requires compromises in the data model that people are mostly unwilling to live with.

You can do pretty well if your target database supports idempotent operations.

If you're trying to do pretty well, then you can do a Merkle Tree comparison of both sides at some time (T0 = now() - epsilon) to efficiently search into which records have been lost or misplaced. Then you re-sync them. Here, for efficency, your merkle tree implementation will ideally to span field(s) that are related to the updated_at time, so that only a small subset of the tree is changing all the time. This is a tricky thing to tune.

You'll still be "open to odd inconsistencies when updates are applied out of order" if you haven't made your data model immutable, but I think this is mostly inline with your hopes.

atombender · on May 3, 2015

Merkle trees (as used for anti-entropy in Cassandra, Riak etc.) are only practical when both sides of the replication can speak them, though.

I wonder, are Merkle trees viable for continuous streaming replication, not just repair?

fizx · on May 3, 2015

You'd implement the merkle trees yourself in the application layer. Alternately, you could use hash lists. It'd be somewhat similar to how you'd implement geohashing. Let's say you just take a SHA-X of each row represented as a hex varchar, then do something like "SELECT COUNT(*) GROUP BY SUBSTR(`sha_column`, 0, n)". If there's a count mismatch, then drill down into it by checking the first two chars, the first three chars, etc. Materialize some of these views as needed. It's ugly and tricky to tune.

Merkle trees aren't interesting in the no-repair case, as the changelog is more direct and has no downside.

po · on May 2, 2015

It is often advocated as a datastore for logging data... which means (in that case) it's usually the primary datastore but perhaps not mission-critical.

alrs · on May 2, 2015

It's a great index for log data.

Spew your log data into a standard syslog server, while also pumping it into Logstash.

Using Elasticsearch as your canonical log storage would be ridiculous.

rodgerd · on May 2, 2015

Once you start relying on it to understand the state of whatever it's logging, it's mission-critical.

tomjen3 · on May 2, 2015

It would probably be good enough as a store for A/B testing information - losing data here isn't critical but writing speed is.