> JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.<p>It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.<p>OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.<p>I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.<p>Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!<p>- DiskANN: <a href="https://dl.acm.org/doi/10.5555/3454287.3455520" rel="nofollow">https://dl.acm.org/doi/10.5555/3454287.3455520</a><p>- Anisotropic Vector Quantization (PQ Compression): <a href="https://arxiv.org/abs/1908.10396" rel="nofollow">https://arxiv.org/abs/1908.10396</a><p>- JVector/#168: How to support building larger-than-memory indexes <a href="https://github.com/jbellis/jvector/issues/168">https://github.com/jbellis/jvector/issues/168</a><p>- JVector/#244: Build indexes using compressed vectors <a href="https://github.com/jbellis/jvector/pull/244">https://github.com/jbellis/jvector/pull/244</a>
Maybe I’m missing something but I’ve created vector embeddings for all of English Wikipedia about a dozen times and it costs maybe $10 of compute on Colab, not $5000
I made a side project that uses Wikipedia recently too, and found out that there are database dump available to be downloaded: <a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>
You can demo this here: <a href="https://jvectordemo.com:8443/" rel="nofollow">https://jvectordemo.com:8443/</a><p>GH Project: <a href="https://github.com/jbellis/jvector">https://github.com/jbellis/jvector</a>
"The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall."<p>How is a log N search over S segments O(N)?
The source files appear to include pages from all namespaces, which is good, because a lot of the value of Wikipedia articles is held in the talk page discussions, and these sometimes get stripped from projects that use Wikipedia dumps.
What are the good solutions in this space? Vector databases I mean. Mostly for semantic search across various texts.<p>I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.
This is a giant dataset of 536GB of embeddings. I wonder how much compression is possible by training or fine-tuning a transformer model directly using these embeddings, i.e., no tokenization/decoding steps? Could a 7B or 14B model "memorize" Wikipedia?
How do embeddings created by state of the art open source models compare to the free embeddings mentioned in the article? Would they actually cost 5k to create given a reasonable local GPU setup?
Would be interesting if you could try implementing the Cohere Reranker into this. Should be fairly easy, and could lead to quite a bit of performance gain.
> Enough RAM to run a JVM with 36GB of heap space<p>Are there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".
>Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off.<p>This is an indication to me that something has gone very wrong in your code base.
Why is the author listing himself as datastax cto?<p>He isn’t according the Wikipedia, my friend who works there, and their company website. <a href="https://www.datastax.com/our-people" rel="nofollow">https://www.datastax.com/our-people</a><p>That’s kind of weird