Postgres full text search is good enough

416 pointsby chdirover 10 years ago

25 comments

birkenover 10 years ago

Well the article is right in one sense: Postgres full text search is probably better than you think it would be and is very usable at non-massive scaleHowever, I still wouldn't recommend using it. The queries and indexes are all unintuitive, complex, not very readable and require learning a bunch of new syntax anyways (as you can see in the examples in the blog post). So just take the effort you would spend doing that and instead learn something with more features that will continue running no matter how large your scale.While the blog mentions elasticsearch/solr (which are awesome and powerful), it doesn't mention sphinx (<a href="http://sphinxsearch.com/" rel="nofollow">http://sphinxsearch.com/</a>). If I were trying to make a simple search engine for my postgres/mysql powered site, I'd use sphinx. The indexer can be fed directly with a postgres/mysql query, and it includes simple APIs for all languages to interface with the searcher daemon. And of course it is super fast, has way more features and can scale larger than you'd ever need (sphinx powers craiglist's search).

评论 #8382023 未加载

评论 #8382005 未加载

评论 #8382049 未加载

rachbelaidover 10 years ago

As the author, it's maybe worthwhile that I explain the goals of this post.The first goal is about using your current architecture to solve small/medium search needs and help your project to introduce search without requiring to add components to your current architecture. It's not a solution on the long run if your application revolve around search. I'm hoping that with the introduction of JSONB in 9.4, the idea of using Postgres as document store raise more attention on the search which can to be improved. But for an out-of-the-box feature, I think that the Postgres community did an amazing job. You can imagine that you get something working with Postgres in matter of hours if you are already using it (and you are comfortable with PG), but if you have never used SOLR/ElasticSearch it will take you much longer to introduce it in your project and get started (Ops, document sync, getting familiar with query, ...).The second goal, is about introducing full-text search concepts. The post try to guide the user to build a search from nothing to a quite decent full text search in English / French (I cannot give feedbacks on other languages)The third which is probably the less clear is that people are still using MySQL search sometimes which is IMHO an horrible search solution. I think this happen because some web framework like django provide easy access to match in the ORM. In this context, the post is aimed also to provide MySQL user some insights about what can be done with Postgres and being more aware about its features.If you are interested by the topic then I suggest you to have a look to the amazings posts from Tim van der Linden who did an amazing job of going into more details about the subject.<a href="http://shisaa.jp/postset/postgresql-full-text-search-part-1.html" rel="nofollow">http://shisaa.jp/postset/postgresql-full-text-search-part-1....</a> <a href="http://shisaa.jp/postset/postgresql-full-text-search-part-2.html" rel="nofollow">http://shisaa.jp/postset/postgresql-full-text-search-part-2....</a> <a href="http://shisaa.jp/postset/postgresql-full-text-search-part-3.html" rel="nofollow">http://shisaa.jp/postset/postgresql-full-text-search-part-3....</a>Postgres full-text search is not the silver bullet of search but matter of your needs, it's maybe good enough ;)

评论 #8382380 未加载

评论 #8382959 未加载

评论 #8382437 未加载

评论 #8383908 未加载

评论 #8384115 未加载

评论 #8383314 未加载

pilifover 10 years ago

While it's certainly good enough for english bodies, its half-assed support for compounds is a problem for bodies in languages that use compounds (German for example).There is sorta-support for compounds, but only really for ispell dictionaries and the ispell format isn't very good at dealing with compounds (you have to declare all permutations and manually flag them as compoundable) plus the world overall has moved over to hunspell, so even just getting an ispell dictionary is tricky.As a reminder: This is about finding the "wurst" in "weisswürste" for example.Furthermore, another problem is that the decision whether the output of a dictionary module should be chained into the next module specified or not is up to the C code of the module itself, not part of the FTS config.This bit me for example when I wanted to have a thesaurus match for common misspellings and colloquial terms which I then further wanted to feed into the dictionary, again for compound matching.Unfortunately, the thesaurus module is configured as a non-chainable one, so once there's a thesaurus match, that's what's ending up in the index. No chance to ever also looking it up in the dictionary.Changing this requires changing the C code and subsequently deploying your custom module, all of which is certainly doable but also additional work you might not want to have to do.And finally: If you use a dictionary, keep in mind that it's by default not shared between connections. That means that you have to pay the price of loading that dictionary whenever you use the text search engine the first time over a connection.For smaller dictionaries, this isn't significant, but due to the compound handling in ispell, you'll have huge-ass(tm) dictionaries. Case in point is ours which is about 25 Megs in size and costs 0.5 seconds of load time on an 8 drive 15K RAID10 array.In practice (i.e. whenever you want to respond to a search in less than 0.5 secs, which is, I would argue, always), this forces you to either use some kind of connection pooling (which can have other side-effects for your application), or you use the shared_ispell extension (<a href="http://pgxn.org/dist/shared_ispell" rel="nofollow">http://pgxn.org/dist/shared_ispell</a>), though I've seen crashes in production when using that which led me to go back to pgbouncer.Aside of these limitations (neither of which will apply to you if you have an english body, because searching these works without a dictionary to begin with), yes, it works great.(edited: Added a note about chaining dictionary modules)

评论 #8382219 未加载

评论 #8381841 未加载

评论 #8381852 未加载

评论 #8383243 未加载

brianhempelover 10 years ago

For full text search, the only global frequency information Postgres uses is the stop-word list. It does not do TF-IDF ranking. [1]For example, if you search for "Bob Peterson", Postgres will rank these two documents the same:"I saw Bob.""I saw Peterson."In contrast, an IDF-aware search would notice that "Peterson" occurs in fewer documents than "Bob" and score "I saw Peterson" higher for that reason.[1] <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf" rel="nofollow">http://en.wikipedia.org/wiki/Tf%E2%80%93idf</a>[2] <a href="http://stackoverflow.com/questions/18296444/does-postgresql-use-tf-idf" rel="nofollow">http://stackoverflow.com/questions/18296444/does-postgresql-...</a>

评论 #8383459 未加载

petergeogheganover 10 years ago

I'm surprised that there is no mention of the recent big improvements in the GIN infrastructure that Postgres full text search uses. The 9.4 release notes list:* Reduce GIN index size (Alexander Korotkov, Heikki Linnakangas)* Improve speed of multi-key GIN lookups (Alexander Korotkov, Heikki Linnakangas)

Argorakover 10 years ago

The most important sentence: "Probably not if your core business needs revolve around search."Postgres full text search is very good, but once you get into the realms were Elasticsearch and SOLR really shine (complex scoring based on combinations of fields, temporal conditions or in multiple passes, all that with additional faceting etc.), trying to rebuild all that on top of Postgres will be a pain.While that doesn't break the article, it runs into a nasty problem: `unaccent` doesn't handle denormalized accents.<pre><code> # SELECT unaccent(U&'\0065\0301'); unaccent ---------- é (1 row) </code></pre> (That problem is also present in Elasticsearch if you forget to configure the analyzer to normalize properly before unaccenting)

评论 #8382201 未加载

rboling91over 10 years ago

Shouldn't this query:SELECT to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;be rewritten as:SELECT to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = post.id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;

yror10over 10 years ago

If your working with a Ruby/Rails app, check out the pg_search gem <a href="https://github.com/Casecommons/pg_search" rel="nofollow">https://github.com/Casecommons/pg_search</a>It allows you to do the things in this article with ActiveRecord models.

评论 #8383361 未加载

mihai_ionicover 10 years ago

One use case where Postgres FTS fails spectacularly is when your search has to return results ordered by some field (e.g., timestamp). This is critical for applications that have to return close-to-realtime results due to the queried content being time-sensitive.Neither GIN nor GiST indices support such document ordering. For a top-N query of the most recent results, the entire result set has to be processed. With common domain-specific words, this can be more than 50% of the entire document set. As you can imagine, this is insanely expensive. When you are in this situation, it helps to set gin_fuzzy_search_limit to a conservative number (20000 works for me, less if you expect heavy traffic) in postgresql.conf, so that pathological queries eventually finish without processing every result. Result quality will take a hit, because many documents are skipped over.If you need any type of ordered search on more than a hundred thousand documents, do yourself a favor and use something else than Postgres.I'm not sure what search back-end Wikipedia is using, but it seems like they are not quite immune to this problem either: <a href="https://en.wikipedia.org/w/index.php?search=a+the" rel="nofollow">https://en.wikipedia.org/w/index.php?search=a+the</a>

rhemaover 10 years ago

I really wish I had some benchmarks here. Postgres can be crazy fast, but can it compete with Elasticsearch? How long would a typical query across 1,000,000 documents?

评论 #8381855 未加载

评论 #8381839 未加载

hendryover 10 years ago

Postgres, elasticsearch and SOLR all seem pretty heavyweight to me.What's the lightweight approach to create an inverted index nowadays with some basic stemming?

评论 #8382078 未加载

评论 #8382006 未加载

评论 #8384422 未加载

评论 #8381886 未加载

评论 #8383106 未加载

评论 #8386729 未加载

评论 #8381882 未加载

评论 #8381884 未加载

elchiefover 10 years ago

Text search is ok in postgres. Better than mysql. Not as good as solr, but you don't have to sync with solr if you use it.My main complaint is that 'united states' @@ 'united states of america' is false, ie no token overlap similarity.You can overcome this w the smlar extension. Have requested smlar integration in 9.5

评论 #8383705 未加载

ipmbover 10 years ago

We're using it on our IRC logging service, <a href="https://botbot.me/" rel="nofollow">https://botbot.me/</a> and agree with the author's sentiments. It's not perfect, but it keeps our infrastructure simple and is very fast over 20M+ rows.

resca79over 10 years ago

I love this article especially when I'm thinking about the premature optimization. Having a good full text-search with adding zero components to the platform is great. When postgres search will be not enought, happy to add elasticsearch, sorl, and so on.

bshimminover 10 years ago

It's awesome what you can do with Postgres, undeniably, but it seems to me that if you're building a Rails app, for instance, it's an absolute no-brainer to use the brilliant Sunspot gem with Solr. Yes, you have the moderate unpleasantness of setting up Solr and fiddling about with Java and XML files, but it's all pretty painless after that.I don't know if similarly nice things exist for other web frameworks.As others have said, what's described in the OP may well be "good enough" for many purposes but it seems like quite a lot of effort to get there.

评论 #8382165 未加载

评论 #8383470 未加载

hiharryhereover 10 years ago

Full text search is good when you have big passages of text to search, but it isn't good for typos on short fields. The tolerances just aren't configurable enough.I was using it to search across user profiles, and while the fuzzy search was great for a slab of text in, say, an about me section, it wasn't good at all for short bits of text like name. Its just too fuzzy for short blobs.

rom16384over 10 years ago

At least for our use case Postgres full text search wasn't good enough. Indexing and searching 1M records was troublesome, and its advanced queries weren't as flexible as we'd like, so after a while we started using Solr, and while it was better it still had its share of problems, so we are currently using ElasticSearch.

评论 #8382154 未加载

评论 #8384396 未加载

rorydhover 10 years ago

Here's a nice PDF on some of the great performance you can get out of it <a href="https://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf" rel="nofollow">https://wiki.postgresql.org/images/2/25/Full-text_search_in_...</a>

chdirover 10 years ago

If you've used both FTS in postgres and either ElasticSearch or Solr recently:Objectively, what are some data points when you need to switch over? E.g. more than 'x' values to index, index update time, or perhaps features like prefix/substring/phrase search, lack of facets, etc.

tim333over 10 years ago

For knocking up a site quickly with a search button I wonder how this compares with using Google custom search? I guess with Postgres you can cover stuff that Google has not indexed?

jcrollover 10 years ago

For the record: MySQL has fulltext search already for MYISAM tables and will have fulltext for InnoDB in 5.6.

评论 #8386360 未加载

chatmanover 10 years ago

Faceting, spell correction, sloppy search, etc. are tricky to implement in Postgres.

fiatjafover 10 years ago

Swish-e is good enough.

seivanover 10 years ago

I wrote a half-assed full text search in Redis once. Want to go back and work on it some more just for kicks.

iamleppertover 10 years ago

If you're going to do something, do it right. Or leave search out of your product if you aren't willing to invest the effort and energy required to implement a proper solution.Those SQL queries are hideous. Most people have learned that, because of many developers like this, to avoid search because it's next to useless. Please do us a favor and leave it out of your product if you aren't willing to invest in a proper solution. Not one that is "good enough".

评论 #8384337 未加载