PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)

427 pointsby VoVAllenabout 1 month ago

14 comments

retakemingabout 1 month ago

I'm one of the pg_search maintainers. Hello! A few thoughts.First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: <a href="https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX" rel="nofollow">https://www.postgresql.org/docs/current/textsearch-tables.ht...</a>.Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:- Created composite b-tree indexes for each of the queries with boolean predicates- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONBBut that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.

评论 #43630016 未加载

评论 #43629079 未加载

danpalmerabout 1 month ago

> Mistake #1: Calculating tsvector On-the-Fly (Major issue)I'm shocked that the original post being referred to made this mistake. I recently implemented Postgres FTS in a personal project, and did so by just reading the Postgres documentation on FTS following the instructions. The docs lead you through the process of creating the base unoptimized case, and then optimising it, explaining the purpose of each step and why it's faster. It's really clear that is what it's doing, and I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

评论 #43627960 未加载

评论 #43628356 未加载

评论 #43628711 未加载

评论 #43629110 未加载

评论 #43629344 未加载

ltbarcly3about 1 month ago

I first used pg full text in around 2008. I've also used SOLR and ElasticSearch to power search and recommendation in substantial products.The issue I have had with postgres full text search isn't that it's too slow, it's that it's too inflexible. It's a nice way to add simple search to fields but poor if you want to tune the search at all. Even allowing for general substrings is too much to ask, even allowing for custom tokenization is too much to ask. There's no tokenization pipeline to speak of unless you want to write c extensions (which of course you can't do for hosted databases anyway). Solr and Elasticsearch let you set up very complex indexes and search processing via configuration. There's absolutely nothing that would prevent postgres from adopting a lot of this capability, but instead postgres offers literally NOTHING. I get the impression that most of the developers for postgres full text haven't spent much time with other solutions as from previous discussions they don't really understand what I mean when I talk about tokenization and filter setup, and they don't really understand why this is a deal-breaker even for very simple applications. Postgres just splits on whitespace (and lets you basically manually use stopwords and stemming, which is crap). There is really no way to concatenate fields in a clean way into a single index, which again makes it extremely annoying to work with. There's no way to score searches based on field weighting or really any other kind of weighting beyond BM. Compared to the alternatives it's a toy system.

nattaylorabout 1 month ago

I wish there were some explain plans in either post, since I don't get what's going on.If the query uses the index, then the on the fly tsvector rechecks are only on the matches and the benchmark queries have LIMIT 10, so few rechecks right?Edit: yes but the query predicates have conditions on 2 gin indexes, so I guess the planner chooses to recheck all the matches for one index first even though it could avoid lots of work by rechecking row-wise

nostreboredabout 1 month ago

I'm legitimately curious -- why do people want to put EVERYTHING into postgres? I don't understand this trend (vector search, full text search, workload orchestration, queues, etc.)

评论 #43628242 未加载

评论 #43628388 未加载

评论 #43628217 未加载

评论 #43628661 未加载

评论 #43628099 未加载

评论 #43628791 未加载

评论 #43630714 未加载

评论 #43629108 未加载

评论 #43631938 未加载

评论 #43628482 未加载

评论 #43628431 未加载

评论 #43628350 未加载

评论 #43629540 未加载

评论 #43628425 未加载

评论 #43632943 未加载

评论 #43628678 未加载

评论 #43628093 未加载

评论 #43629587 未加载

评论 #43628136 未加载

评论 #43628197 未加载

Vonngabout 1 month ago

BTW for anyone who interested, I've packed pg_search and vchord_bm25 extension RPM/DEBs <a href="https://pigsty.io/ext/fts/vchord_bm25" rel="nofollow">https://pigsty.io/ext/fts/vchord_bm25</a> <a href="https://pigsty.io/ext/fts/pg_search" rel="nofollow">https://pigsty.io/ext/fts/pg_search</a> In case anyone want to benchmark by themselves. ;)

zhousunabout 1 month ago

Glab to see more 'postgres-native' full-text search implementation.Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

评论 #43628811 未加载

atemerevabout 1 month ago

10 million records is a toy dataset. Usually, you can fit it in memory on a laptop.There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.

some_developerabout 1 month ago

Years ago I wanted to use native FTS (because of tall the things mentioned, having to sync to external simply adds complexity) and it failed at another point.Not completely surprising, but on a table with _potentially_ couple of thousand of inserts / seconds, it slowed down the overall updates to the point that transactions timed out.We already added an index for one of the columns we wanted to index and were running the statement for the second one. The moment this the second index finished, we started to see timeouts from our system when writing to that table, transaction failing etc.We had to drop the indices again. So, sadly, we did never get to the point to test the actual FTS performance :/ I would have like to test this, because didn't necessarily had to search hundreds of millions of documents, due to customer tenants this would always be constrained to a few million _at most_.ps: I already wrote about this -> <a href="https://news.ycombinator.com/item?id=27977526">https://news.ycombinator.com/item?id=27977526</a> . Never got a chance to try it nowadays (newer versions of everything, never hardware, etc.)

评论 #43632595 未加载

chenhoey1211about 1 month ago

I’ve seen a lot of teams jump straight to Elasticsearch or Meilisearch without realizing how much performance you can get out of native PG FTS when used properly.could we get similar performance in the browser using something like SQLite + FTS5 + Wasm? Seems like an interesting direction for offline-first apps...

johnthescottabout 1 month ago

for text search we use the "rum" extension from postgrespro. search terabytes of pdfs < 1sec. a talk here<pre><code> https://github.com/jmscott/talk/blob/master/pgday-austin-20161112.pdf</code></pre>

briankellyabout 1 month ago

> 10M log entriesI don’t think the question is speed, it’s scale. Use it until it breaks, though.

Thaxllabout 1 month ago

When it's custom code that runs on PG I don't call it native.

评论 #43630728 未加载

Syttenabout 1 month ago

Off topic but this kind of content marketing is excellent for any startup that tries to get its name out compared with a well-known competitor, like it is the case here with ParadeDB.