Dear reader: your intuition is right: it's not "better" to reduce the information by a factor of 32.<p>The game the article plays is to do a KNN search to deal with the fact this flattens similarity significantly.<p>TL;DR from the field:<p>- this is extremely helpful for doing a first pass over a _ton_ of documents in a resource constrained environment.<p>- it is extremely <i>unhelpful</i> unless you're retrieving 10x the documents you want via binary, then doing re-ranking via the FP-32 to rank the remaining.<p>- in general, it's unlikely you need the technique unless you're A) on the edge, i.e. on consumer devices from 3 years ago or B) you have tens of millions of vectors on a server. All this stuff sounds really fancy, but when you implement it from scratch, you quickly learn "oh its 384 numbers I gotta multiply together"<p>Source: I do embeddings, locally, to do retrieval for RAG. I "discovered" this about a year ago, and it deeply pains me to see anything that will misinform a lot of people.<p>Free bonus I haven't seen mentioned elsewhere yet*: you can take the average of the N embeddings forming a document to figure out if you should look at the N embeddings individually. This does over smooth too, ex. my original test document is the GPT-4 sparks paper, and the variety of subjects mentioned and length (100+ pages) meant it was over smooth when searching for the particular example I wanted it to retrieve (the unicorn SVG)<p>* edited to clarify given reply. also my reaction to reading it, "that dude rocks!!", made me wanna go off a little bit: if you're not in AI/ML, don't be intimidated by it when its wrapped in layers of obscure vocabulary. once you have the time to go do it, things you would have thought that were "stupid" turn out to be just fine. and you find like-minded souls. It's exhilirating