科技回声

4 条评论

Really appreciate this post. I've been thinking a lot about hnsw and researching / playing with them.Adding parallelization to the mix could really help with a thought experiment I've been trying to tackle: if you wanted to embed the entire internet in a way that's feasibly hostable and updatable, how do you do it?Well it sure can't live in RAM. And as the index gets large, insertion gets very slow.Let's leave the RAM bit aside for a bit.So what if we cluster and have one index per cluster? Well it turns out features often belong to multiple clusters, not one.Ok so we soft cluster- but it turns out choosing the number of clusters is also very hard. So maybe we use HDBSCAN.Well that is very slow at scale too.I talked about this on Twitter and Leland McInnes responded suggesting UMAP (he's the lead author of both HDBSCAN and UMAP papers - was a bit starstruck)So if we now reduce dimensionality of embeddings with UMAP to say 5 dimensions in order to create soft clusters with HDBSCAN and create one hnsw index per cluster using the full / non-reduced embeddings (consider a member to be a point where the cluster is its kth most probabilistic).And the problem is starting to get more tractable. Search requires the umap step and calling predict proba on HDBSCAN to find which k clusters / hnsw indices to search.Now the problem is updating... What if we add a bunch of documents and clusters are effectively different? Seems like either you need to start with a representative sample so you don't need to rebalance, or come up with a reclustering step. The MST of HDBSCAN might simplify this.So the RAM thing- yeah. Well, now that we have a bunch of individual cluster-based indices, we only need to load the ones the current search requires.And we might not even need to do that. I built this approach I called "portable hnsw" that actually served the indices as parquet files which supports range requests, so you don't even need to load the whole thing into memory. (Unless you want to update the index)Really interested in your thoughts.

评论 #38845011 未加载

justinclift超过 1 年前

This seems to be related, as in being an implementation for PostgreSQL:<a href="https://news.ycombinator.com/item?id=38844945">https://news.ycombinator.com/item?id=38844945</a>

weeksie超过 1 年前

Nice one, Gavin!

speps超过 1 年前

Typo both on HN and TFA. It's "hierarchical".Reminds me of when there's a typo in a codebase and the auto completion just replicates it across everywhere.

4 条评论

jasonjmcghee超过 1 年前

评论 #38845011 未加载

justinclift超过 1 年前

This seems to be related, as in being an implementation for PostgreSQL:<a href="https://news.ycombinator.com/item?id=38844945">https://news.ycombinator.com/item?id=38844945</a>

weeksie超过 1 年前

Nice one, Gavin!

speps超过 1 年前

Typo both on HN and TFA. It's "hierarchical".Reminds me of when there's a typo in a codebase and the auto completion just replicates it across everywhere.

Parallelizing HNSW (Hierarchical Navigable Small World) graphs

4 条评论

Parallelizing HNSW (Hierarchical Navigable Small World) graphs

4 条评论