I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.<p>Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.<p><a href="https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e9736bb3e8f17dcf659c0a1/benchmarks/benchmark-tsquery.sh#L77">https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e...</a> – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);<p>This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.<p>GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.<p>Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.
Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!<p>pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.<p>To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (<a href="https://www.paradedb.com/">https://www.paradedb.com/</a>).
With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?<p>My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?
pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.<p>Awesome to see so many high quality extensions come out of it.<p><a href="https://github.com/pgcentralfoundation/pgrx">https://github.com/pgcentralfoundation/pgrx</a>
Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count(<i>) from table where text match is there`?<p>This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT(</i>) in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.<p>ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.
What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).
Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?
Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.
Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?
Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres
This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.
looks like a cool project <a href="https://github.com/paradedb/paradedb">https://github.com/paradedb/paradedb</a>