If someone is looking for the referenced paper (<i>Accelerating Large-Scale Inference with Anisotropic Vector Quantization</i>), it can be found here: <a href="https://arxiv.org/abs/1908.10396" rel="nofollow">https://arxiv.org/abs/1908.10396</a><p>(It's also linked in the repository's docs/algorithms.md, to which there's a link at the bottom of the main page.)
Right now ANN is a huge (and fascinating) area for active data structures/Algos research.<p>I do sort of wonder if we will reach a point where instead of pure recall per speed, we’ll have a family of algorithms with more trade offs for the use case. Where we begin to look at metrics for that domain for a given ANN approach instead of just its ability to recreate a nearest neighbors set retrieval.<p>Like one thing I see assumed is that while recall at N is good, does this also mean the “ranking” of that top N is ideal? I don’t want to have to manually compute NN on this top N if I can avoid it, for example.<p>And are there specific vector spaces where one ANN is preferred? Or will there be some universal approach that just works for everything?<p>I realize it’s too early to tell, but these questions always percolate in my mind when we hit a new benchmark in recall for speed. Especially since I still see people doing more naive things that seem to work perfectly fine for their use case (like KD trees, random projections, or LSH)
Ok, so I am just trying to understand the basic concepts in the paper and put it in my own words:<p>It seems that the primary idea is that quantization precision is more important where there is a high density of neighbors.<p>I.e. at the edges the quantized sections (buckets) could be large since there are few items there, but at high density areas, the buckets should be much smaller in order to have an even distribution of objects per bucket as possible.<p>Therefore, the overall effectiveness of a Quantization loss function should not be evaluated on a sum of squared error (that assumes the vector space has consistent linear value), but should rather consider the densities of the vector space and use that as a weight of the errors at different regions.<p>To me it seems analogous to a hash set, where the goal would be to have even distribution (same number of items in every bucket).<p>We want to quantize space so that every position has about the same number of items.
Wow, this looks impressively fast at very reasonable recall rates in the ANN-benchmarks. It seems to leave faiss and nmslib in the dust. Pulling up the arXiv paper as we speak to figure out what these guys are doing to achieve such impressive results.
I'm surprised that I don't see DBSCAN, HDBSCAN, Spectral, etc. I don't even <i>recognize</i> these methods. Am I missing something or have the methods I'm familiar with become obsolete that fast?
Would this algorithm suit the case of wanting to find neighbors within a set radius of a point? Does anyone know of an approximate method for doing this?