TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Parallelizing HNSW (Hierarchical Navigable Small World) graphs

49 点作者 LukeEF超过 1 年前

4 条评论

jasonjmcghee超过 1 年前
Really appreciate this post. I&#x27;ve been thinking a lot about hnsw and researching &#x2F; playing with them.<p>Adding parallelization to the mix could really help with a thought experiment I&#x27;ve been trying to tackle: if you wanted to embed the entire internet in a way that&#x27;s feasibly hostable and updatable, how do you do it?<p>Well it sure can&#x27;t live in RAM. And as the index gets large, insertion gets very slow.<p>Let&#x27;s leave the RAM bit aside for a bit.<p>So what if we cluster and have one index per cluster? Well it turns out features often belong to multiple clusters, not one.<p>Ok so we soft cluster- but it turns out choosing the number of clusters is also very hard. So maybe we use HDBSCAN.<p>Well that is very slow at scale too.<p>I talked about this on Twitter and Leland McInnes responded suggesting UMAP (he&#x27;s the lead author of both HDBSCAN and UMAP papers - was a bit starstruck)<p>So if we now reduce dimensionality of embeddings with UMAP to say 5 dimensions in order to create soft clusters with HDBSCAN and create one hnsw index per cluster using the full &#x2F; non-reduced embeddings (consider a member to be a point where the cluster is its kth most probabilistic).<p>And the problem is starting to get more tractable. Search requires the umap step and calling predict proba on HDBSCAN to find which k clusters &#x2F; hnsw indices to search.<p>Now the problem is updating... What if we add a bunch of documents and clusters are effectively different? Seems like either you need to start with a representative sample so you don&#x27;t need to rebalance, or come up with a reclustering step. The MST of HDBSCAN might simplify this.<p>So the RAM thing- yeah. Well, now that we have a bunch of individual cluster-based indices, we only need to load the ones the current search requires.<p>And we might not even need to do that. I built this approach I called &quot;portable hnsw&quot; that actually served the indices as parquet files which supports range requests, so you don&#x27;t even need to load the whole thing into memory. (Unless you want to update the index)<p>Really interested in your thoughts.
评论 #38845011 未加载
justinclift超过 1 年前
This seems to be related, as in being an implementation for PostgreSQL:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38844945">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38844945</a>
weeksie超过 1 年前
Nice one, Gavin!
speps超过 1 年前
Typo both on HN and TFA. It&#x27;s &quot;hierarchical&quot;.<p>Reminds me of when there&#x27;s a typo in a codebase and the auto completion just replicates it across everywhere.