Vector indexing all of Wikipedia on a laptop

513 pointsby tjake12 months ago

21 comments

> JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!- DiskANN: <a href="https://dl.acm.org/doi/10.5555/3454287.3455520" rel="nofollow">https://dl.acm.org/doi/10.5555/3454287.3455520</a>- Anisotropic Vector Quantization (PQ Compression): <a href="https://arxiv.org/abs/1908.10396" rel="nofollow">https://arxiv.org/abs/1908.10396</a>- JVector/#168: How to support building larger-than-memory indexes <a href="https://github.com/jbellis/jvector/issues/168">https://github.com/jbellis/jvector/issues/168</a>- JVector/#244: Build indexes using compressed vectors <a href="https://github.com/jbellis/jvector/pull/244">https://github.com/jbellis/jvector/pull/244</a>

评论 #40516913 未加载

评论 #40516363 未加载

gfourfour12 months ago

Maybe I’m missing something but I’ve created vector embeddings for all of English Wikipedia about a dozen times and it costs maybe $10 of compute on Colab, not $5000

评论 #40514831 未加载

评论 #40514761 未加载

评论 #40516452 未加载

评论 #40514675 未加载

评论 #40517608 未加载

评论 #40514803 未加载

burgerrito12 months ago

I made a side project that uses Wikipedia recently too, and found out that there are database dump available to be downloaded: <a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>

isoprophlex12 months ago

$5000?! I indexed all of HN for ... $50 I think. And that's tens of millions of posts.

评论 #40517289 未加载

评论 #40527182 未加载

tjake12 months ago

You can demo this here: <a href="https://jvectordemo.com:8443/" rel="nofollow">https://jvectordemo.com:8443/</a>GH Project: <a href="https://github.com/jbellis/jvector">https://github.com/jbellis/jvector</a>

评论 #40515693 未加载

评论 #40517296 未加载

评论 #40517625 未加载

HammadB12 months ago

"The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall."How is a log N search over S segments O(N)?

评论 #40519198 未加载

评论 #40518546 未加载

jl612 months ago

The source files appear to include pages from all namespaces, which is good, because a lot of the value of Wikipedia articles is held in the talk page discussions, and these sometimes get stripped from projects that use Wikipedia dumps.

评论 #40517911 未加载

hot_gril12 months ago

How many dimensions are in the original vectors? Something in the millions?

评论 #40515666 未加载

noufalibrahim12 months ago

What are the good solutions in this space? Vector databases I mean. Mostly for semantic search across various texts.I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.

评论 #40521108 未加载

评论 #40521423 未加载

localhost12 months ago

This is a giant dataset of 536GB of embeddings. I wonder how much compression is possible by training or fine-tuning a transformer model directly using these embeddings, i.e., no tokenization/decoding steps? Could a 7B or 14B model "memorize" Wikipedia?

anonymousDan12 months ago

How do embeddings created by state of the art open source models compare to the free embeddings mentioned in the article? Would they actually cost 5k to create given a reasonable local GPU setup?

lilatree12 months ago

“… turning an O(log N) search per segment into O(N) overall.”Can someone explain why?

评论 #40518523 未加载

opdahl12 months ago

Would be interesting if you could try implementing the Cohere Reranker into this. Should be fairly easy, and could lead to quite a bit of performance gain.

arnaudsm12 months ago

In expert topics, is vector search finally competitive with BM25-like algorithms? Or do we still need to mix the 2 together ?

评论 #40517975 未加载

Khelavaster12 months ago

This is how Microsoft powered it's academic paper search in 2016, before rolling it into Bing in 2020!

Mathnerd31412 months ago

> Enough RAM to run a JVM with 36GB of heap spaceAre there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".

评论 #40525642 未加载

评论 #40526914 未加载

danaugrs12 months ago

Loosely related: <a href="https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/" rel="nofollow">https://www.quantamagazine.org/computer-scientists-invent-an...</a>

issafram12 months ago

Would a docker container help running it on Windows?

评论 #40524380 未加载

评论 #40520044 未加载

traverseda12 months ago

>Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off.This is an indication to me that something has gone very wrong in your code base.

评论 #40515720 未加载

评论 #40515844 未加载

评论 #40514890 未加载

评论 #40515495 未加载

评论 #40515187 未加载

评论 #40516332 未加载

评论 #40520590 未加载

评论 #40518117 未加载

评论 #40516292 未加载

m3kw912 months ago

He should have asked HN on the cheapest way to embed Wikipedia before starting

评论 #40519211 未加载

Atotalnoob12 months ago

Why is the author listing himself as datastax cto?He isn’t according the Wikipedia, my friend who works there, and their company website. <a href="https://www.datastax.com/our-people" rel="nofollow">https://www.datastax.com/our-people</a>That’s kind of weird

评论 #40519336 未加载

评论 #40519177 未加载

评论 #40519170 未加载

评论 #40519166 未加载