Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The SQLite Query Planner (sqlite.org)
117 points by otoolep on Aug 21, 2015 | hide | past | favorite | 22 comments


The brilliance of the "SQLite Query Planner Stability Guarantee", or more succinctly put, its deterministic query planning, cannot be overstated.

With a "real" database, if your data has an unfortunate pattern that confuses ANALYZE statistics, you can suddenly get thousandfold slowdowns because the query planners assume that their statistical information is perfect and do not use any kind conservativeness in their planning. So you need a babysitter for them: a DBA.

I've missed the predictability of SQLite when moving stuff to PostgreSQL. If the query plan in SQLite looks good, it'll stay good. In a real database, you're sitting on a timebomb waiting to go off when autoanalyze ends up running at a bad time.


> If the query plan in SQLite looks good, it'll stay good. In a real database, you're sitting on a timebomb waiting to go off when autoanalyze ends up running at a bad time.

That's unfortunately naive. For example, for a certain type of query there are two possible plans. Which one is better depends on the parameters to the plan and the distribution of the data, and getting the plan wrong can be the difference between the query finishing and the query not finishing. With a stability guarantee you can't even try to make an intelligent choice based on the parameters.


> getting the plan wrong can be the difference between the query finishing and the query not finishing

But that's exactly the point. If the SQLite got the plan right at first, you know it will always "get it right" in the future, i.e. the query will finish. Maybe it'll go from being optimal to somewhat less optimal, but it'll still be fine -- or at worst you'll see a very, very gradual slowdown (as your tables grow) that you'll have time to investigate later.

With MySQL or Postgres, that can suddenly change and go from, in your example, the query finishing, to the query not finishing, because the database suddenly chooses to use a different plan. That can be a catastrophic production failure, when a query that used to run in 10ms now takes 10s.


> But that's exactly the point. If the SQLite got the plan right at first, you know it will always "get it right" in the future, i.e. the query will finish. Maybe it'll go from being optimal to somewhat less optimal, but it'll still be fine -- or at worst you'll see a very, very gradual slowdown (as your tables grow) that you'll have time to investigate later.

Reality is much more complex. Unless your data set size and the distribution continously stay the same, the algorithm that made complete sense initially (say an index nested loop) will often not make any sense anymore after some growth. Imagine a nested loop over a couple hundred thousand rows where the inner side's index lookup always has to hit storage. That just won't perform and will, if run concurrently, slow down the entire system to a standstill because it consumes all the IOPS. In many cases that slowdown won't be gradual at all - it'll be fast as long the working set fits entirely into the available memory, and will stop very soon afterwards (depends on value distribution).


I've seen situations where, depending on how MySQL is feeling, it will use an index for an order by, and sometimes not (uses a filesort instead).

For some data sets, it's reasonably quick (100ms), for others it's quite slow (1600ms).

Whereas if I add 'force index for order by' to the query, I get a predictable 150ms across the board.

Predictability is often more important than optimality. Variance in service response time is hard to work around, especially when there's a UI involved. Algorithmically, that often means always using the logarithmic data structure even when the constant overhead is higher for smaller data.


I have known so many engineers that rely on wishful thinking, it gets pretty frustrating after a while. If they do it with bugs they'll do it with resource planning.

Another place where they do this is oversubscribing a server, by putting 4 services that need 20-30% of the system memory or network to run, and don't understand why throughput drops off by 90% when the system becomes saturated and starts swapping or dropping packets.


I always wondered why databases don't allow the direct specification of the query plan, instead of forcing SQL developers and DBAs to try to guess the right words that produces the right query plan, usually. It seems like it would be a lot easier to express the query plan in, say, a protobuf, rather than parsing a DSL on every query.


I know this is a simplistic and ideological notion, but anybody who can manually plan a query better than the query planner should be using their time to write better query planners, optimizers, and statistics collectors as opposed to manually writing query plans.


Most databases have ways for you to specify the exact plan you want used already. Since you mentioned SQL Server, you can force join order, force index usage, force the type of join used (loop, merge, hash), etc.

There are terrible drawbacks using any of those techniques (data changes, indexes being modified / dropped, etc) which is why, imo, they should very rarely be used. However, 'allowing direct specification of the query plan' is already implemented in that respect.

The better way to address the problem of different parameters causing performance issues is forcing recompiles which will solve the problem 95% of the time.


This is what collecting and managing statistics[0][1] is about.

I know Oracle and Postgres implements this -- don't know about SQLServer, MySQL, etc...

[0] http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats....

[1] http://www.postgresql.org/docs/current/static/runtime-config...

edit: found/added info for postgres


Sure but I don't want to influence the query plan, I want to dictate it.


I know at least Microsoft's SQL Server allows you to simulate this by fixing a query plan: https://technet.microsoft.com/en-us/library/Cc917694.aspx

Having worked with SQL Server for years as a DBA, I've never had to use it once, but I was never working with databases over, say, 1 TB. So while we did encounter slow query problems, they could usually be resolved by introducing reliable indexes, and the performance cost of this was low.


The "The Next Generation Query Planner" (NGQP) document https://www.sqlite.org/queryplanner-ng.html linked to from this document makes for very interesting reading.

In some cases upgrading from the pre-3.8.0 legacy query planner and to the NGQP in 3.8.0 may cause a serious performance regression. Running ANALYSE on your DB should fix that.

Apart from that the Stability Guarantee sounds, I don't know, obvious? Repeatability and predictability under more or less the same conditions should be a given I would have thought. In the DB world this must not necessarily be the case?


Does anyone know how can I contribute to this documentation site? This entire site is so much more readable if you just put a max-width of about 800px on the container div.


Report issues on the sqlite-users mailing list. The SQLite team are very careful about accepting contributions, and have signed contribution papers. They generally don't accept outside code, because they do very thorough testing so completed code needs accompanying complete test code too. They are also fanatical about backwards compatibility.

https://www.sqlite.org/testing.html https://www.sqlite.org/copyright.html



I run my browser at half my laptop's screen width. It's ironic that I have to turn off styles (Firefox: View/Page Style/No Style) so that I can read the page without horizontal scrolling back and forth.

But more fundamentally, as a free service, I wonder how they can afford the server and storage resources needed for this (it's not in their FAQ).

I'm not a member, so I don't know what their logged in pages look like, nor their readability-ized pages. But I doubt if advertising is paying the bills.

The next obvious thing is some kind of analytics. "People who read this also read that." Sold to whoever can use such information.

Or maybe it's a charitable effort. Or maybe they expect to be sold, I don't know.

I'm not bashing them at all, but I am becoming much more critical, privacy-wise, of everything I touch on the web. I'm always asking, subconsciously at least, "what's in it for them?"

Similarly, I'm looking up at my toolbar, and wondering what's in it for the people who made my addons for Ghostery, Disconnect and UBo. They would seem to be sitting on a goldmine of analytics.

Dunno. But my web experience feels much more adversarial today than it did in the early nineties.


If you prefer a narrower view, why not just resize your browser window to be more narrow? Some of us find full-width text more readable, and artificially constrained viewports are not something we can easily resize.


1 through 5 all essentially amount to this: if the WHERE constraint indicates a contiguous subsequence of the index... or a small list of such subsequences... and btw we're mildly clever about reparsing your constraint to try to find a way to make it into a small list of contiguous subsequences. It helps to imagine the rows (or rather the index entries sitting in order in a file, and considering where the wanted rows would sit in that file.


I like SQLite a lot, but for me it really seems to break down when the db grows to about 10 GB in size when running 64 bit Ubuntu. ext4 fs.

Does this mesh with others experiences, or can SQLite databases be performant above 10 GB?


when learning sqlite3, at first i thought it was a toy-database. but than after some exploring i found that its a real big thing! :)


Anybody know how the query planner may change in 4.0?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: