To keep your users engaged, search results need to show up instantly and be relevant to them, even when they do typos.
To try this out, I went to the demo page for searching TV episodes (http://www.algolia.com/demo) and searched for "The Wire Season 2". Here are the four results given, with the highlighted portions bracketed:
[The Wire] Gag Reel [Season 2]
[The] Simple Life [Season 2] - Special - The Stuff We [Were]n't Allowed To Show You
[The] Farmer Wants a [Wife] (Australia)
[Season] 6, Episode [2]
[The] Cosby Show [Season 2] DVD Extra: New Interview with [Dire]ctor Jay Sandrich
Rather than seeking "engagement", I'd put more emphasis on having high quality search results. Having 3 of the 4 results ignore the properly typed title of the show is a terrible interface. Correcting "Wire" to match "Director" is absurd.
The sad part is that these results might make you think that the episodes for season 2 of the "The Wire" aren't in the database, but they are. But they are, just not indexed in a way that they are found using the exact phrase "Season 2".
Trying to be more constructive, there is a typo in the first sentence of your Intro, where the name of your company is spelled wrong. Also, "Real-Time Search" usually means search against a database that is being constantly updated. Anyway, I need to get back to screaming at the kids on my lawn.
Thanks for your comment, you are right that the data are not perfectly indexed for this query. We have taken data from one of our customer and his use case is to search only for TV show names.
If you're wondering if Algolia is right for you, just ask them. Within 5 minutes of initiating the chat window I had the CEO, Julien helping guide me through the process of getting my XML into JSON to see if it was right.
Then he asked me more about my use case and actually steered me towards an Elasticsearch solution since it sounded like a better fit.
All in all we went back in forth communicating for 3-4 days for him to lose me by necessity and I already feel like a satisfied customer.
I don't understand what makes this particular service tout 'realtime' as its primary selling point.
Don't all search engines (and other hosted search services) aim for fast (100s of milliseconds) retrieval, show-as-you-type and realtime indexing?
Don't get me wrong. Getting all this right is very hard, and kudos for the great performance numbers (vs Elasticsearch), but 'realtime search' smacks of marketing copy.
You can try to search-as-you-type on our hacker news search to see the difference with other search engines: http://hn.algolia.com/
You have relevant results after each keystroke, even with typos. Classical engines use approximation to perform instant search, like the suggest module of Elasticsearch.
It seems like for the HN search, your ranking function is the number of votes (or very highly correlated with it). If this is true, its not solving a problem as hard as 'classical' engines, which compute a lot more. It would be great to demonstrate this sort of performance on comparable rank functions. I don't know anything about Elasticsearch ranking though, maybe they have a very simple rank function too.
It is more than just a sort on number of points :)
Our value is to be able to mix textual relevance with business data (in that case the number of points but is can be the number of page views, number of followers, ...).
No offense, but I hate your business model. Convincing devs to put their search db in the hands of a small hosted startup is a recipe for disaster (see indextank).
There must be a better way. ElasticSearch and MongoDB use open source business models that I think tend to work much better for smart devs picking technologies (irrespective of their actual products).
Hmm, from my own experience - yes, there is alternative open search engines, and there are a lot actually.
But did you ever try one on your own? Most of them are a nightmare to setup and are atrociously slow as soon as you get a few thousands entries...
Sometimes, it's definitely worth it to externalize some expertise. Search is definitely not easy to masterize.
(And I was a user of indextank when they shutdown)
Agreed, I think there's a good middle ground. Lots of hosted APIs use open technologies, and I'm paying for the convenience of someone handling everything for me, without being locked into a single provider who will likely get acquihired and shut down at some point.
Unfortunately it does not seem to have the accuracy, or breadth, of the old hnsearch.com. Hopefully this will be fixed in time, but I have found it lacking relevant results and myself switching back to hnsearch on most occasions.
I also wonder about all the other small applications in the "HN ecosystem", like karma tracker, that rely on the hnsearch API. I see that algolia has an API, but will those other projects just die too?
That explains why I noticed a couple of things off on my aggregator. I was using hnsearch.com/rss which recently seemed to have been alter and is now missing most of the data I was actually using.
All great and good to be very fast, but at what price? From their page it costs $450 for 5mil records. In the search world, this is nothing. So I guess its going to come down to if your company is at the point where they need to shave off 1-200ms for hundreds of dollars a month.
Second, I would wait and see how their reliability hashes out before I rely on them for any production services.
The search world is very big :) 5 mil records is nothing if you index logs (which is not Algolia typical use-cases) but for example this is big from an e-commerce perspective.
In the e-commerce world, the difference between 2ms and 200ms isn't that big of a deal. Search relevance, however, might be something that is important. It looks like that is something they are focusing on heavily.
Your algorithm seems OK, but what was the "traditional approach" that you compared it to, and how did you compare them? It seems like you actually gain a lot from full document search (e.g. products with multi-paragraph descriptions). Otherwise, you might as well just do a SQL query to get your results.
Ahaha, speed is huge deal for e-commerce. It has already been proved enough by big merchants tests.
Search relevance is too! The key is to have them both at the same time :)
I think improving on relevance ranking configuration would be a big boost to this product as well as offering some ability to cross-search multiple indexes. Both are quite difficult problems to solve well in search, but if a simple API service was available that might be attractive for larger commercial customers.
The icing on the cake would be to have some support for relational (at least partially relational) data and multimedia / files. Good luck!
First of all, great job guys. The library support is fantastic (node.js, python, ruby, php, even a shell client). We are currently pushing our nginx logs to ElasticSearch, and was going to use ES for some new features on https://commando.io, but instead we will use algolia.
We usually recommend to perform one query (one API call) per keystroke starting from the first one. The actual number of calls depends a lot on the use-case. Our ranking takes into account both relevance and popularity to suggest the best result first which greatly reduces the number of letters you need to type.
In use-cases where there is a very strong popularity indicator, like the number of followers for TV shows, we usually get the correct result at the first keystroke (b -> breaking bad, d -> dexter). At the other extreme, you may need to type several words.
To try this out, I went to the demo page for searching TV episodes (http://www.algolia.com/demo) and searched for "The Wire Season 2". Here are the four results given, with the highlighted portions bracketed:
[The Wire] Gag Reel [Season 2]
[The] Simple Life [Season 2] - Special - The Stuff We [Were]n't Allowed To Show You
[The] Farmer Wants a [Wife] (Australia) [Season] 6, Episode [2]
[The] Cosby Show [Season 2] DVD Extra: New Interview with [Dire]ctor Jay Sandrich
Rather than seeking "engagement", I'd put more emphasis on having high quality search results. Having 3 of the 4 results ignore the properly typed title of the show is a terrible interface. Correcting "Wire" to match "Director" is absurd.
The sad part is that these results might make you think that the episodes for season 2 of the "The Wire" aren't in the database, but they are. But they are, just not indexed in a way that they are found using the exact phrase "Season 2".
Trying to be more constructive, there is a typo in the first sentence of your Intro, where the name of your company is spelled wrong. Also, "Real-Time Search" usually means search against a database that is being constantly updated. Anyway, I need to get back to screaming at the kids on my lawn.