Full text search over Postgres: Elasticsearch vs. alternatives

319 点作者 philippemnoel10 个月前

19 条评论

simonw10 个月前

The BM25 thing is actually a really big deal.BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.

评论 #41174792 未加载

评论 #41174469 未加载

评论 #41175176 未加载

评论 #41178376 未加载

评论 #41177339 未加载

评论 #41174677 未加载

评论 #41181414 未加载

ashconnor10 个月前

Facets are possible in Postgres but it looks complex:<a href="https://web.archive.org/web/20200815141031/https://roamanalytics.com/2019/04/16/faceted-search-with-postgres-using-tsvector/" rel="nofollow">https://web.archive.org/web/20200815141031/https://roamanaly...</a>pg_search [0] not to be confused with the ruby gem of the same name [1]:[0] - <a href="https://github.com/paradedb/paradedb/tree/dev/pg_search#overview">https://github.com/paradedb/paradedb/tree/dev/pg_search#over...</a>[1] - <a href="https://github.com/Casecommons/pg_search">https://github.com/Casecommons/pg_search</a>

评论 #41174004 未加载

评论 #41174801 未加载

jillesvangurp10 个月前

If you care about search quality, postgres ends up being fairly limited. I've done stuff with it; it's alright for really simple stuff. But it's the wrong tool for the job generally.If the order of the search results matters to you, you might want something that gives you some more tools to control that. And if you are not measuring search quality to begin with, you probably don't care enough to even know that you are missing the tools to do a better job.I consult clients on this stuff professionally and I've seen companies do all sorts of silly shit. Mostly it's because they simply lack the in house expertise which is usually how they end up talking to me.I've actually had to sit clients down and explain them their own business model. Usually I come in for some technical problem and then end up talking to product managers or senior managers about stuff like this because usually the real problem is at that level. The technical issues are just a symptom.Here's a discussion I had with a client fairly recently (paraphrasing/exaggerating, obviously):"Me: So your business model is that your users find shit on your web site (it has a very prominent search box at the top) and then some transaction happens that causes you to make money? Customer: yes. Me: so you make more money if your search works better and users find stuff they want. Customer: yes, we want to make more money! Me: congratulations, you are a search company! Customer: LOL whut?! Me: So, why aren't you doing the kinds of things that other search companies do to ensure you maximize profit? Like measuring how good your search is or generally giving a shit whether users can actually find what they are looking for. Customer: uhhhhh ???? Me: where's your search team? Customer: oh we don't have one, you do it! Me: how did you end up with what you currently have. Customer: oh that guy (some poor, overworked dev) over there picked solution X 3 years ago and we never gave it a second thought."Honestly, some companies can't be helped and this was an example of a company that was kind of hopelessly flailing around and doing very sub optimal things at all levels in the company. And wasting lots of time and money in the process. Not realizing your revenue and competitiveness are literally defined by your search quality is never a good sign. You take different decisions if you do.

评论 #41179341 未加载

samsk10 个月前

For internet user facing full-text search I would always prefer to use a separate tool and not a SQL database, because- the fulltext tool, can and should hold only 'active' data- as it has only active data, data size is usually much much smaller- as data size is smaller, it better fits in RAM- as data size is smaller, it can be probably run on poorer HW the full ACID db- as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned (never seen a corruption till now)- as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)- as the indexed data are mostly read-only, the can be easily backup-ed- as the backups are smaller, restoring a backup can be very fast- and there is no such thing as database upgrade outage, you just spin a new version, feed it with new data and than change the backends- functionality and extensibilityThere is probably more, but if one doesn't needs to do a fulltext search on whole database (and you usually don't), than its IMHO better to use separate tool, that doesn't comes with all the ACID constraints. Probably only downside is that you need to format data for the FTS and index them, but if you want run a serious full-text search, you will have to take almost the same steps in the database.On a 15y old side project, I use SOLR for full-text search, serving 20-30k/request per day on a cheap VM, and PostgreSQL is used as primary data source. The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc... During that outages the full-text search always worked - it didn't had most recent data, but most users probably never noticed.

评论 #41174772 未加载

评论 #41174815 未加载

radpanda10 个月前

> Because Elasticsearch is not a reliable data store, organizations that use Postgres typically extract, transform, and load (ETL) data from Postgres to ElasticsearchI’ll admit haven’t kept up with this but is it still the case that Elasticsearch is “not a reliable data store”?I remember there used to be a line in the Elasticsearch docs saying that Elasticseach shouldn’t be your primary data store or something to that effect. At some point they removed that verbiage, seemingly indicating more confidence in their reliability but I still hear people sticking with the previous guidance.

评论 #41174354 未加载

评论 #41174846 未加载

评论 #41175728 未加载

评论 #41173931 未加载

评论 #41177822 未加载

评论 #41178597 未加载

sgift10 个月前

Solr. The correct answer is Solr. All the search features of Elasticsearch (both are built on Lucene, which provides the actual search implementation) without the problems that Elastic (the company) or Elasticsearch (the product) brings with it. 99% of companies using Elasticsearch would be far better of with Solr.

评论 #41179866 未加载

评论 #41180719 未加载

amai10 个月前

Alternatives to both are<a href="https://www.meilisearch.com/" rel="nofollow">https://www.meilisearch.com/</a><a href="https://typesense.org/" rel="nofollow">https://typesense.org/</a>and maybe<a href="https://github.com/Sygil-Dev/whoosh-reloaded">https://github.com/Sygil-Dev/whoosh-reloaded</a>

评论 #41177174 未加载

评论 #41178906 未加载

评论 #41210447 未加载

评论 #41180018 未加载

glintik10 个月前

Postgres has very limited features related to FTS, so it is useful only for very simple cases or for very limited budget. Elasticsearch is stable enough(but good devops and devs needed) and has numerous of features. Anyway Elasticsearch is not a "plug and play" solution and requires much work to be done for good search quality. Why I know this all? I'm running ecommerce search startup, based on Elasticsearch as primary storage.

alanwli10 个月前

Always great to see Postgres-based alternatives.One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?[0]: <a href="https://news.ycombinator.com/item?id=37864089">https://news.ycombinator.com/item?id=37864089</a>

评论 #41177666 未加载

dewey10 个月前

I've had good experiences with both Postgres + ES and also Postgres + Meilisearch. The latter one I only used for small side projects but it was very easy to get running (With a RoR app).

coding12310 个月前

The analytical part is what usually makes the Elasticsearch (or opensearch) the choice - it's not that you have to search just the text, its that with that text you need to show 5 different categories some of the top values and the number of occurrences and permutations thereof if the user decides to clicky clicky and filter it more.

评论 #41178022 未加载

sideway10 个月前

Semi-related: I guess some companies have already started augmenting their search with the use of LLMs. If you've worked on a similar project, what is your (very) high-level architecture? Did you see a noticeable difference in relevance and query time?

评论 #41180011 未加载

killme200810 个月前

I believe the primary issue with using full-text search in PostgreSQL is its performance and scalability limitations.Elasticsearch/OpenSearch offers more than just search functionality; they come with extensive ecosystems, including ELK for logging and processors for ETL pipelines. These platforms are powerful and provide out-of-the-box solutions for developers. However, as the article mentioned, scaling them up or out can be costly.ParadeDB looks like is based on tantivy(<a href="https://github.com/quickwit-oss/tantivy">https://github.com/quickwit-oss/tantivy</a>) which is an impressive project. We leverage it to implement full-text indexing for GreptimeDB too (<a href="https://github.com/GreptimeTeam/greptimedb">https://github.com/GreptimeTeam/greptimedb</a>). Nevertheless, building full-text indexes with Tantivy is still resource-intensive. This is why Grafana Loki, which only indexes tags instead of building full-text indexes, is popular in the observability space. In GreptimeDB, we offer users the flexibility to choose whether to build full-text indexes for text fields, while always creating inverted indexes for tags.

demilich10 个月前

<a href="https://github.com/infiniflow/infinity">https://github.com/infiniflow/infinity</a>, dense vector + sparse vector + fulltext search(BM25) + late interact reranker(Colbert)

j4510 个月前

Solr still seems to be an option

评论 #41176884 未加载

评论 #41176804 未加载

Sytten10 个月前

AWS RDS support is key IMO, I have done a lot of consulting on infrastructure and most clients I had don't want to have third party services and don't want to manage their DBs.

评论 #41185190 未加载

haolez10 个月前

Are there success cases of using ElasticSearch/OpenSearch as the primary data store in production? Just curious.

评论 #41178026 未加载

izietto10 个月前

Is this distributed as a PostgreSQL extension as well?

评论 #41180718 未加载

corytheboyd10 个月前

> We’ve talked to many companies who have tried and regretted their decision to use Elasticsearch as their primary data storeoh no…