I'm one of the pg_search maintainers. Hello! A few thoughts.<p>First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: <a href="https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX" rel="nofollow">https://www.postgresql.org/docs/current/textsearch-tables.ht...</a>.<p>Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"<p>`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:<p>- Created composite b-tree indexes for each of the queries with boolean predicates<p>- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB<p>But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.
> Mistake #1: Calculating tsvector On-the-Fly (Major issue)<p>I'm shocked that the original post being referred to made this mistake. I recently implemented Postgres FTS in a personal project, and did so by just reading the Postgres documentation on FTS following the instructions. The docs lead you through the process of creating the base <i>unoptimized</i> case, and then optimising it, explaining the purpose of each step and why it's faster. It's really clear that is what it's doing, and I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.
I first used pg full text in around 2008. I've also used SOLR and ElasticSearch to power search and recommendation in substantial products.<p>The issue I have had with postgres full text search isn't that it's too slow, it's that it's too inflexible. It's a nice way to add simple search to fields but poor if you want to tune the search at all. Even allowing for general substrings is too much to ask, even allowing for custom tokenization is too much to ask. There's no tokenization pipeline to speak of unless you want to write c extensions (which of course you can't do for hosted databases anyway). Solr and Elasticsearch let you set up very complex indexes and search processing via configuration. There's absolutely nothing that would prevent postgres from adopting a lot of this capability, but instead postgres offers literally NOTHING. I get the impression that most of the developers for postgres full text haven't spent much time with other solutions as from previous discussions they don't really understand what I mean when I talk about tokenization and filter setup, and they don't really understand why this is a deal-breaker even for very simple applications. Postgres just splits on whitespace (and lets you basically manually use stopwords and stemming, which is crap). There is really no way to concatenate fields in a clean way into a single index, which again makes it extremely annoying to work with. There's no way to score searches based on field weighting or really any other kind of weighting beyond BM. Compared to the alternatives it's a toy system.
I wish there were some explain plans in either post, since I don't get what's going on.<p>If the query uses the index, then the on the fly tsvector rechecks are only on the matches and the benchmark queries have LIMIT 10, so few rechecks right?<p>Edit: yes but the query predicates have conditions on 2 gin indexes, so I guess the planner chooses to recheck all the matches for one index first even though it could avoid lots of work by rechecking row-wise
I'm legitimately curious -- why do people want to put EVERYTHING into postgres? I don't understand this trend (vector search, full text search, workload orchestration, queues, etc.)
BTW for anyone who interested, I've packed pg_search and vchord_bm25 extension RPM/DEBs
<a href="https://pigsty.io/ext/fts/vchord_bm25" rel="nofollow">https://pigsty.io/ext/fts/vchord_bm25</a>
<a href="https://pigsty.io/ext/fts/pg_search" rel="nofollow">https://pigsty.io/ext/fts/pg_search</a>
In case anyone want to benchmark by themselves. ;)
Glab to see more 'postgres-native' full-text search implementation.<p>Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.
10 million records is a toy dataset. Usually, you can fit it in memory on a laptop.<p>There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.
Years ago I wanted to use native FTS (because of tall the things mentioned, having to sync to external simply adds complexity) and it failed at another point.<p>Not completely surprising, but on a table with _potentially_ couple of thousand of inserts / seconds, it slowed down the overall updates to the point that transactions timed out.<p>We already added an index for one of the columns we wanted to index and were running the statement for the second one. The moment this the second index finished, we started to see timeouts from our system when writing to that table, transaction failing etc.<p>We had to drop the indices again. So, sadly, we did never get to the point to test the actual FTS performance :/ I would have like to test this, because didn't necessarily had to search hundreds of millions of documents, due to customer tenants this would always be constrained to a few million _at most_.<p>ps: I already wrote about this -> <a href="https://news.ycombinator.com/item?id=27977526">https://news.ycombinator.com/item?id=27977526</a> . Never got a chance to try it nowadays (newer versions of everything, never hardware, etc.)
I’ve seen a lot of teams jump straight to Elasticsearch or Meilisearch without realizing how much performance you can get out of native PG FTS when used properly.<p>could we get similar performance in the browser using something like SQLite + FTS5 + Wasm? Seems like an interesting direction for offline-first apps...
for text search we use the "rum" extension from postgrespro. search terabytes of pdfs < 1sec. a talk here<p><pre><code> https://github.com/jmscott/talk/blob/master/pgday-austin-20161112.pdf</code></pre>
Off topic but this kind of content marketing is excellent for any startup that tries to get its name out compared with a well-known competitor, like it is the case here with ParadeDB.