Pg_bm25: Elastic-Quality Full Text Search Inside Postgres

206 pointsby billwashereover 1 year ago

19 comments

I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.<a href="https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e9736bb3e8f17dcf659c0a1/benchmarks/benchmark-tsquery.sh#L77">https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e...</a> – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.

评论 #37812370 未加载

评论 #37815218 未加载

评论 #37812378 未加载

评论 #37814032 未加载

retakemingover 1 year ago

Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (<a href="https://www.paradedb.com/">https://www.paradedb.com/</a>).

评论 #37813389 未加载

评论 #37817697 未加载

phamiltonover 1 year ago

With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?

评论 #37811925 未加载

评论 #37811793 未加载

评论 #37811776 未加载

hardwaresoftonover 1 year ago

pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.Awesome to see so many high quality extensions come out of it.<a href="https://github.com/pgcentralfoundation/pgrx">https://github.com/pgcentralfoundation/pgrx</a>

评论 #37812429 未加载

评论 #37815062 未加载

wkoszekover 1 year ago

Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.

评论 #37813868 未加载

评论 #37814309 未加载

machtyover 1 year ago

What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).

评论 #37864089 未加载

mkleczekover 1 year ago

This is really exciting and I hope to try it out at my company ASAP.

iamdanieljohnsover 1 year ago

Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?

评论 #37815265 未加载

ckokover 1 year ago

Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.

antmanover 1 year ago

An important step, could be a good combination with pg_vector if they are fast enough

评论 #37810808 未加载

tristan957over 1 year ago

Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?

rawshover 1 year ago

Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres

评论 #37813504 未加载

anon373839over 1 year ago

This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.

estover 1 year ago

looks like a cool project <a href="https://github.com/paradedb/paradedb">https://github.com/paradedb/paradedb</a>

aiunboxedover 1 year ago

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

评论 #37811716 未加载

评论 #37811302 未加载

评论 #37811398 未加载

评论 #37814732 未加载

评论 #37812497 未加载

评论 #37811656 未加载

canadiantimover 1 year ago

ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.

eclectic29over 1 year ago

Is BM25 still used by "modern" search engines? I wasn't aware.

mugivarra69over 1 year ago

is this better than lucene

评论 #37812906 未加载

stopmanover 1 year ago

Excited to give this a try.