Having worked with Simon he knows his sh*t. We talked a lot about what the ideal search stack would look when we worked together at Shopify on search (him more infra, me more ML+relevance). I discussed how I just want a thing in the cloud to provide my retrieval arms, let me express ranking in a fluent "py-data" first way, and get out of my way<p>My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.
Unrelated to the core topic, I really enjoy the aesthetic of their website. Another similar one is from Fixie.ai (also, interestingly, one of their customers).
> $3600.00/TB/month<p>It doesn't have to be that way.<p>At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.<p>Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.
> In 2022, production-grade vector databases were relying on in-memory storage<p>This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?
Sounds like a source-unavailable version of Quickwit? <a href="https://quickwit.io/" rel="nofollow">https://quickwit.io/</a>
Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?<p>Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.<p>I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.
Is it feasible to try to build this kind of approach (hot SSD cache nodes sitting in front of object storage) with prior open-source art (Lucene)? Or are the search indexes themselves also proprietary in this solution?<p>Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be <i>incredible</i>. The applicability here isn't only for vector search.
A correction to the article. It mentions<p><pre><code> Warehouse BigQuery, Snowflake, Clickhouse ≥1s Minutes
</code></pre>
For ClickHouse, it should be: read latency <= 100ms, write latency <= 1s.<p>Logging, real-time analytics, and RAG are also suitable for ClickHouse.
This looks super interesting. I'm not that familiar with vector databases. I thought they were mostly something used for RAG and other AI-related stuff.<p>Seems like a topic I need to delive into a bit more.
Slightly relevant - do people really want article recommendations? I don’t think I’ve ever read an article and wanted a recommendation. Even with this one - I sort of read it and that’s it; no feeling of wanting recommendations.<p>Am I alone in this?<p>In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.
That’s some woefully disappointing and incorrect metrics (read and write latency are both sub-second, storage medium would be “ Memory + Replicated SSDs”) you’ve got for Clickhouse there, but I understand what you’re going for and why you categorized it where you did.