Nixiesearch: Running Lucene over S3, and why we're building a new search engine

128 点作者 shutty8 个月前

21 条评论

Both Elastic and Opensearch also have S3 based stateless versions of their search engines in the works. The Elastic one is available in early access currently. It would be interesting to see how this on improves on both approaches.With all the licensing complexities around Elastic, more choice is not necessarily bad.The tradeoff with using S3 is indexing latency (the time between the write getting accepted and being visible via search) vs. easy scaling. The default refresh interval (the time the search engine waits before committing changes to an index) is 1 second. That means it takes upto 1 second before indices get updated with recently added data. A common performance tweak is to increase this to 5 or more seconds. That reduces the number of writes and can improve write throughput, which when you are writing lots of data is helpful.If you need low latency (anything where users might want to "read" their own writes), clustered approaches are more flexible. If you can afford to wait a few seconds, using S3 to store stuff becomes more feasible.Lucene internally stores documents in segments. Segments are append only and there tend to be cleanup activities related to rewriting and merging segments to e.g. get rid of deleted documents, or deal with fragmentation. Once written, having some jobs to merge segments in the background isn't that hard. My guess is that with S3, the trick is to gather whatever amount of writes up and then store them as one segment and put that in S3.S3 is not a proper file system and file operations are relatively expensive (compared to a file system) because they are essentially REST API calls. So, this favors use cases where you write segments in bulk and never/rarely update or delete individual things that you write. Because that would require updating a segment in S3, which means deleting and rewriting it and then notifying other nodes somehow that they need to re-read that segment.For both Elasticsearch and Opensearch log data or other time series data fits very well to this because you don't have to deal with deletes/updates typically.

评论 #41804403 未加载

mdaniel8 个月前

> Nixiesearch uses an S3-compatible block storage (like AWS S3, Google GCS and Azure Blob Storage)Hair-splitting: I don't believe Blob Storage is S3 compatible, so one may want to consider rewording to distinguish between whether it really, no kidding, needs "S3 compatible" or it's a euphemism for "key value blob storage"I'm fully cognizant of the 2017 nature of this, but even they are all "use Minio" <a href="https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazon-s3-compatible-apps-azure-storage/" rel="nofollow">https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo...</a> which I guess made a lot more sense before its license change. There's also a more recent question from 2023 (by an alleged Microsoft Employee!) with a very similar "use this shim" answer: <a href="https://learn.microsoft.com/en-us/answers/questions/1183760/s3-api-support-over-azure-blob-storage" rel="nofollow">https://learn.microsoft.com/en-us/answers/questions/1183760/...</a>

评论 #41799815 未加载

评论 #41811818 未加载

oersted8 个月前

Check out Quickwit, it is briefly mentioned but I think mistakenly dismissed. They have been working on a similar concept for a few years and the results are excellent. It’s in no way mainly for logs as they claim, it is a general purpose cloud native search engine like the one they suggest, very well engineered.It is based on Tantivy, a Lucene alternative in Rust. I have extensive hands on experience with both and I highly recommend Tantivy, it’s just superior in every way now, such a pleasure to use, an ideal example of what Rust was designed for.

评论 #41798396 未加载

评论 #41802947 未加载

评论 #41798715 未加载

评论 #41799767 未加载

评论 #41798388 未加载

评论 #41804775 未加载

评论 #41804679 未加载

评论 #41798365 未加载

gyre0078 个月前

It took us almost 2 decades but finally the truly cloud native architectures are becoming a reality. Warp and Turbopuffer are some of the many other examples

评论 #41798237 未加载

评论 #41798980 未加载

评论 #41803552 未加载

mikeocool8 个月前

I love all of the software coming out recently backed by simple object storage.As someone who spent the last decade and half getting alerts from RDBMSes I’m basically to the point that if you think your system requires more than object storage for state management, I don’t want to be involved.My last company looked at rolling out elastic/open search to alleviate certain loads from our db, but it became clear it was just going to be a second monstrously complicated system that was going to require a lot of care and feeding, and we were probably better off spending the time trying to squeeze some additional performance out of our DB.

评论 #41798143 未加载

评论 #41799346 未加载

评论 #41798222 未加载

hipadev238 个月前

I know block storage backends is all the rage, but this is about the most capital intensive thing you can do on the major cloud providers. Storage and reads are cheap, but writes and list operations are insanely expensive.Once you hook these backends up to real-time streaming updates, transactions, heavy indexing, or immutable backends that cause constant churn (hive/hudi/iceberg/delta lake), you're in for a bad time financially.

mhitza8 个月前

I've used offline indexing with Solr back in 2010-2012, and this was because the latency between the Solr server and the MySQL db (indexing done via dataimport handler) was causing the indexer to take hours instead of the sub 1 hour (same server vs servers in same datacenter).In many ways Solr has come a long way since, and I'm curious to see how well they can make a similar system perform in the cloud environment.

warangal8 个月前

I myself have been working on a personal search engine for sometime, and one problem i faced was to have an effective fuzzy-search for all the diverse filenames/directories. All approaches i could find were based on Levenshtein distance , which would have led to storing of original strings/text content in the index, and neither would be practical for larger strings' comparison nor would be generic enough to handle all knowledge domains. This led me to start looking at (Local sensitive hashes) LSH approaches to measure difference b/w any two strings in constant time. After some work i finally managed to complete an experimental fuzzy search engine (keyword search is a just a special case!).In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: <a href="https://eagledot.xyz/malhar.md.html" rel="nofollow">https://eagledot.xyz/malhar.md.html</a> . I tried to submit it here to gather more feedback but didn't work i guess!

评论 #41799377 未加载

marginalia_nu8 个月前

This would have been a lot easier to read without all the memes and attempts to inject humor into the writing. It's a frustrating because it's an otherwise interesting topic :-/

评论 #41798166 未加载

whalesalad8 个月前

I recently got back into search after not touching ES since like 2012-2013. I forgot how much of a fucking nightmare it is to work with and query. Love to see innovation in this space.

评论 #41820719 未加载

评论 #41799446 未加载

novoreorx8 个月前

It seems that some of the goals and functionalities of Nixiesearch overlap with those of Turbopuffer [^1], though the latter is only focusing on vector search. I also resonate that search engine should be stateless and affordable to deploy for everyone.[1]: <a href="https://turbopuffer.com/blog/turbopuffer" rel="nofollow">https://turbopuffer.com/blog/turbopuffer</a>

mannyv8 个月前

I forgot that a reindex on solr/lucene blows away the index. Now I remember how much of a nightmare that was because you couldn't find anything until that was done - which usually was a few hours when things were hdd based.Just started a search project, and this one will be on the list for sure.

manx8 个月前

I thought about creating a search engine using <a href="https://github.com/phiresky/sql.js-httpvfs">https://github.com/phiresky/sql.js-httpvfs</a>, commoncrawl and cloudflare R2. But never found the time to start...

评论 #41798270 未加载

评论 #41798119 未加载

ko_pivot8 个月前

I’m a fan of all these projects that are leveraging S3 to implement high availability / high scalability for traditionally sensitive stateful workloads.Local caching is a key element of such architectures, otherwise S3 is too slow and expensive to query.

评论 #41798245 未加载

huntaub8 个月前

This is a super cool project, and I think that we will continue to see more and more applications move towards an "on S3" stateless architecture. That's part of the reason why we are building Regatta [1]. We are trying to enable folks who are running software that needs file system semantics (like Lucene) to get the super-fast NVME-like latencies on data that's really in S3. While this is awesome, I worry about all of the applications which don't have someone rewrite a bunch of layers to work on S3. That's where we come in.[1] <a href="https://regattastorage.com" rel="nofollow">https://regattastorage.com</a>

tomhamer8 个月前

I might be missing something but how is this different to amazon opensearch with ultrawarm storage? I think amazon launched this about 4 years ago right?

评论 #41820687 未加载

parhamn8 个月前

Stateless S3 apps have much more appeal given the existence of Cloudflare R2 -- bandwidth is free and GetObject is $0.36 per million requests.

drastic_fred8 个月前

In a world, recommendations outpaced the full text search (95%/5%), cost reduction is essential.

ctxcode8 个月前

Sounds like this is going to cost alot of money. (more than it should)

stroupwaffle8 个月前

There’s no such thing as stateless, and there’s no such thing as serverless.The universe is a stateful organism in constant flux.Put another way: brushing-it-under-the-rug as a service.

评论 #41798822 未加载

cynicalsecurity8 个月前

This is a great way to waste investors' money.