Why HNSW is not the answer and disk-based alternatives might be more practical

138 点作者 kevlened5 个月前

14 条评论

SIMD-accelerated search over binary vectors is very fast and you can fit a lot in memory. Then rerank over f32 from disk.I tried a few alternatives and found that SQLite + usearch extension wins for my use cases (< 1 million records), as measured by latency and simplicity. I believe this approach can scale up to hundreds of millions of records.

评论 #42499647 未加载

generall5 个月前

IVF, unfortunately, is barely compatible with filtered search. It have to rely on post-filtering and retrieve more and more candidates if the result set is not big enough. If the query is in some form correlated with the filter, this approach quickly degrades to brute-force.Surprised that the article doesn't mention filtering use-case at all.

评论 #42499471 未加载

tveita5 个月前

> For example, the typical distance computation complexity between two D-dimensional vectors is O(D^2), but compressing floats into bits reduces this by a factor of 1024 (32x32).This isn't right, cosine distance between two vectors is definitely O(D)…Of course replacing float multiplications with xor+popcount is a nice speedup in computation, but assuming you're memory bandwidth limited, speedup should be linear.

评论 #42499512 未加载

评论 #42498772 未加载

antirez5 个月前

Quantization can be applied exactly in the same way in HNSW. I'm using quantization in the implementation of Redis vector sets and it works very well. I have very big issues with the other points as well, but I prefer to reply to these concerns with the implementation I hope to have into Redis in a short time (early 2025).About insertion / deletion cost. Sure, they are costly, but if instead of grabbing one of the available implementations you start from scratch, extend the paper in sensible ways, and experiment with it, I think it is possible to do much better than one could initially believe.

PaulHoule5 个月前

I've been doing vector search since 2002 or so and it's amazing how far you can get keeping vectors in RAM and using primitive search algorithms, enough that I'm afraid the market for vector databases is 1% of what VC's think it is. (e.g. full scan was good enough for a search engine of international patents and non-patent literature)

评论 #42497345 未加载

评论 #42496990 未加载

评论 #42496907 未加载

评论 #42508476 未加载

ayende5 个月前

The problem with IVF is that you need to find the right centroids. And that doesn't work well if your data grow and mutate over time.Splitting a centroid is a pretty complex issue.As are clustering in an area. For example, let's assume that you hold StackOverflow questions & answers. Now you have a massive amount of additional data (> 25% of the existing dataset) that talks about Rust.You either need to re-calculate the centroids globally, or find a good way to split.The posting list are easy to use, but if you are unbalanced, it gets really bad.

评论 #42499546 未加载

nostrebored5 个月前

Nit: the drawback of “not working well in disk based systems” isn’t a drawback unless you’re already using disk based systems.The difference in recall is also significant — what you really get with HNSW is a system made to give good cost:approximation quality. These IVFPQ based systems are ones I’ve seen people rip and replace if the use case is high value.I really don’t understand the push to make pg do everything. It wasn’t designed for search, and trying to shove these features into the platform feels like some misguided cost optimization that puts all of your data infrastructure on the same critical path.

评论 #42499657 未加载

binarymax5 个月前

HNSW became the defacto default specifically because you don’t need to precalculate the index and it updates as writes come in.This is a great article, and all the points are there, but the truth is that most teams running million scale vectors don’t want the operational complexity of quantizing offline in some frequency. They’ll gladly outsource the costs to paying for RAM instead of some IVFPQ calculation.However if there were a solution that “just worked” to handle PQ for shards in near real time for updates that also had sane filtering, that would be really nice.

评论 #42497428 未加载

评论 #42499576 未加载

评论 #42497243 未加载

adeptima5 个月前

Highly recommend to prompt the following in LLM if you are trying to grasp Hybrid(Keyword, Filter + Vector) search or HNSW vs IVF flavour topicsGraph-Based (HNSW) vs Cluster-Based (IVF)Memory Usage: Higher for HNSW due to graph storage requirements vs Lowerfor IVF; stores cluster centroids and subsets of vectorsAccuracy: High recall and precision, close to exact search vs Recall depends on the number of clusters and visited partitionsBuild Time: Slower for HNSW, Faster for IVFBest for: Real-time, high-accuracy tasks requiring dynamic updates vs Large, static datasets with strict memory constraintsNotable Graph-Based (HNSW) Variants - Faiss HNSW or PANNG (Proximity and Navigable Neighbor Graph)Cluster-Based (IVF) Variants - IVF-PQ for memory-constrained environments, Multi-Index IVF partitions, IVF-GPU

ahevenor5 个月前

HNSW can be used for large indexes when combined with an effective caching strategy. Aerospike uses a distributed cache with query steering approach which allows you to have many indexes, including large ones and load them into memory as your application needs them.

评论 #42502172 未加载

cratermoon5 个月前

"Its reliance on frequent random access patterns makes it incompatible with the sequential access nature of disk storage."Is this true for SSD storage, or does it only apply to spinning metal platters?

评论 #42499607 未加载

srajabi5 个月前

What about DiskANN? <a href="https://milvus.io/docs/disk_index.md" rel="nofollow">https://milvus.io/docs/disk_index.md</a>

rekoros5 个月前

Yep, I believe itHNSW has been great for relatively small datasets (a website with 30K pages), but it’s a dangerous stretch for anything bigger

评论 #42496957 未加载

评论 #42496798 未加载

tw045 个月前

> Its reliance on frequent random access patterns makes it incompatible with the sequential access nature of disk storage.So use SSD? Are people seriously still trying to use spinning disk for database workloads in 2024?

评论 #42497984 未加载

评论 #42499608 未加载

评论 #42498633 未加载

评论 #42497497 未加载