I thought this was a great article, and trigram indexes are definitely an amazing tool to have in the arsenal.
I think PostgreSQL fulltext search (with tsvector, tsquery etc.) is given an unfairly bad rap though. I'm not sure that "full text search works best when the text vectors are stored in physical columns with an index" is true - in my experience there's no performance penalty to just indexing the tsvector expression - no need to worry about additional columns or triggers.
I also think the assertion that the key problem is that "full text search is that words are broken up according to the rules defined by the language of the text" is very context-dependent. In many situations that's the most awesome feature of fulltext search. Usually when I search for "cat" I'm not interested in results for catacombs or categories, but when I search for restaurants, results matching restaurant (singular) are relevant too.
I definitely see that for a use-case like GitLab's, where the data includes code, full text search's stemming would be a hindrance rather than a help.
One of the perks of using columns is that you can setup multiple columns with different dictionaries and specify which one you're searching against.
I'm not sure how easy it would be to ensure that the index would get picked up if searching on a weighted field with multiple columns specified or if it would require that the program know in advance what those columns were. Would the order matter for the index to be used?
Additionally, a column can also easily included relevant search data from related records in other tables or even datasources if desired.
There's nothing stopping you indexing multiple columns with different dictionaries either :). And indexing works beautifully with multiple weighted columns.
The related records point is true though - indexes don't help if you're trying to implement multi-table search. Although then I'd argue a materialized view (refreshed with triggers) might be better than putting more columns into one of the data tables.
> I'm not sure that "full text search works best when the text vectors are stored in physical columns with an index" is true
It's mostly based on my past experiences. At my previous gig we indexed quite a bit of data using PostgreSQL's full text search system and we noticed significantly improved performance when using physical columns containing text vectors over just using GIN indexes. It's been a while so the details are a bit fuzzy.
Ah ok, interesting :). We were weighing up exactly these kind of tradeoffs a few weeks ago, and the benchmarks we did put the 2 approaches more or less neck and neck, so maybe PG has managed to optimize things a bit? Or, our search is over hundreds of thousands of rows, so maybe things only slow down in the millions and above...
If I remember correctly the tables we ran full text search on didn't have millions of rows, and we used PostgreSQL's simple/basic language (whichever one of the two it was). We did also run a similarity filter (using the pg_trgm similarity() function), maybe we also had trigram indexes on top. I can't really remember.
Trigrams are a very straightforward solution to this but they have one major limitation: They break with misspellings / alternative spellings. For example the words "vodka" and "votka" share no trigrams yet are obviously very similar.
I recently worked on this problem for a project and came up with a more robust solution using pre-computed Levenshtein indexes. You've probably heard of Levenshtein distance as a measure of word similarity. It counts the number of single-character edits that are required to transform one string into another. For example "vodka" and "votka" have a Levenshtein distance of 1.
Using a related approach (single character edits) it's possible to build a so-called Levenshtein index.
The major issue with a vanilla Levenshtein index is that it's impractically large. There is a way to make a space / time trade-off however that makes it much smaller. The basic idea is that instead of generating all Levenshtein variations for a word you only use the set of variations created by deleting single characters for a word.
Since that doesn't match all possible variations we have to compensate somehow. For example with a full index "votka" would be in the set of variations for "vodka" which means if we had misspelled vodka when searching for it we'd still get a match. With our deletions-only index, however, we'd get "voka" as our closest match. To work around this, we generate the deletion-only variations of our search term as well. In this case our search term "vodka" would become "odka", "vdka", "voka", etc. and each of those terms would be matched against the index. That means our query is slower overall as it contains a number of sub-queries but in practice it's more than fast enough.
There are some additional ideas in there such as how to make a Levenshtein index handle prefix matches better (for e.g., incremental search) that may be helpful if you need to implement something like this.
If you do read that I'd like to point out that I no longer recommend LevelDB. That doesn't change anything of importance in the post though.
Super interesting approach, thanks for contributing it.
From what I see in your article, you used this for searching through mails, so I guess it works on a pretty decent volume of data, but just to confirm it: do you think it's viable on something like gitlab's multi-resources full text search? For reference, what is the size of this index for your mailbox? (assuming it's an usual contractor dev mailbox, with mainly lots of notification mails and discussions with daily customers)
PS: don't worry about other reactions, users have invaded our dev world, nowadays, the sad thing with having attractive salaries
A project I worked on used trigram indexes a few years ago to solve the autocomplete problem. They are AMAZING - they basically allow you to run LIKE queries with wildcards anywhere in the string (as opposed to suffix-only-wildcards) against an index.
An autocomplete search for e.g. "rub rai" becomes the following SQL query:
select * from topics where name ilike "%rub%rai%";
Which, thanks to the magic of trigram indexes returns in just a few ms, even against hundreds of thousands of rows. Without trigram indexes, the same query would be a full scan and would be too slow to justify hooking up to a search-as-you-type UI.
create extension pg_trgm;
create index topic_name_gin on topics using gin (name gin_trgm_ops);
Hi! Author in question here, if anybody would like to know more about this particular feature I'll be happy to answer any questions. The changes discussed in this article are currently available on GitLab.com, 8.6 will be released on the 22nd as usual.
Why bother supporting both MySQL and Postgres for the backend? Wouldn't picking one (~cough~ Postgres) simplify things?
From my experience, attempting to support multiple backend stores leads to either crippled functionality, i.e. lowest common denominator CRUD usage, or lots of messy if/then logic to use DB specific features.
Personally I'd love to drop support for MySQL but we have a few too many organizations that only run MySQL _and_ want to use GitLab. It would be a waste if said organizations wouldn't be able to use GitLab.
The approach I tend to go with is to make queries perform as good as possible on PostgreSQL without using too many PostgreSQL specific bits (which would make supporting both DBs more difficult), while making sure they still perform good enough on MySQL.
IIRC Google also used trigrams for their Code search engine. This allowed them to make a first filter before running a regex search on the resulting documents.
ah yes, the old 're-implement the search engine inside the database' project, undoubtedly put up on the board because someone is tired of their get-the-data-from-the-database-to-the-search-engine process breaking constantly.
next on the map: discovering how shitty dictionary management is, the joys of NLP, and abandoning the project entirely because you realize all this stuff has already been solved in 3 or 4 different ways and the getting-the-data-from-the-database-to-the-search-engine process isn't really that bad, and pulling your hair out from customers asking insane questions because they don't understand how search engines actually work and why can't this be like google? can't you just do it how google does it, even though you don't have $100B and 50,000 employees?
i realize this is a product feature but i have ptsd on this topic so i had to vent.
the joys of NLP ... this is a search index for Gitlab, that's used primarily to store code, not natural language.
Looking at its competitor, Github's search engine clearly has a base in NLP, and ditches many punctuation characters (https://help.github.com/articles/searching-code/) - which are way more important in code than they are in English - for example, I can't search for code containing "$/" when it should have "\Z/" to match the end of a string. So, right now I have ~250 repos checked out from our Github Enterprise so I can search them offline.
The queries I actually want to run are, pretty much, regexes. Those can be accelerated by trigram indexes like the one gitlab are using (see eg https://swtch.com/~rsc/regexp/regexp4.html), but apparently this isn't implemented by say, Elastisearch (Whereas prefix matching can be made more efficient by preparing your data at index time, wildcard and regular expression matching can be done only at query time.https://www.elastic.co/guide/en/elasticsearch/guide/current/...)
I'd agree with you if this wasn't code. I've seen terrible text search in the DB, and I've built document management systems where we integrated real search engines. But those horses aren't for this course.
sure, primarily code. except when code contains language, like in comments, and other related documentation that's stored in-repo. and of course, people are going to completely mis-use the repo to store ancillary human-readable binary documents, which will need to be converted, indexed, searched, updated, purged, etc, etc...
since developers tag, reference, and document their code with comments, they're going to expect the full suite of NLP treatments with all indexed content, including code.
which basically means: you'll have to tokenize out the special characterize for NLP, but retain them for literal code searches, thereby increasing the size of the inverted index for every permutation of the desired search criteria.
maybe you can detect and filter out all the comments, and index them separately, or maybe have some kind of dual system where the NL indexing system exists separately from the code indexing system ... you see where this is going?
they're going to expect the full suite of NLP treatments with all indexed content, including code. Nope. In 30 years of searching through code, I've never once felt the need for that. On the other hand, almost every time I use Github search, I find the limitations of a word-oriented index get in the way (looking for partial keywords to find code relevant to an api, etc. It's not just special characters).
I think PostgreSQL fulltext search (with tsvector, tsquery etc.) is given an unfairly bad rap though. I'm not sure that "full text search works best when the text vectors are stored in physical columns with an index" is true - in my experience there's no performance penalty to just indexing the tsvector expression - no need to worry about additional columns or triggers.
I also think the assertion that the key problem is that "full text search is that words are broken up according to the rules defined by the language of the text" is very context-dependent. In many situations that's the most awesome feature of fulltext search. Usually when I search for "cat" I'm not interested in results for catacombs or categories, but when I search for restaurants, results matching restaurant (singular) are relevant too.
I definitely see that for a use-case like GitLab's, where the data includes code, full text search's stemming would be a hindrance rather than a help.