Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.
> Is it terrible for the environment?<p>> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.<p>Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.<p>As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.<p>It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.
That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.<p>It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
This is a great post. I’ve also been having a lot of fun working with embeddings, with lots of those pages being documentation. We write up a quick post on how are using them in prod, if you want to go from having an embedding to actually using them in a web app:<p><a href="https://www.ethicalads.io/blog/2024/04/using-embeddings-in-production-with-postgres-django-for-niche-ad-targeting/" rel="nofollow">https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...</a>
The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.<p>Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.<p>How could I derive an embedding that represents only content and not tone?
<a href="https://technicalwriting.dev/data/embeddings.html#let-a-thousand-embeddings-bloom" rel="nofollow">https://technicalwriting.dev/data/embeddings.html#let-a-thou...</a><p>> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?<p>Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)
Great post indeed! I totally agree that embeddings are underrated. I feel like the "information retrieval/discovery" world is stuck using spears (i.e., term/keyword-based discovery) instead of embracing the modern tools (i.e., semantic-based discovery).<p>The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!
Cool, first time I've seen one of my posts trend without me submitting it myself. Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
Author of txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>) here. I've been in the embeddings space since 2020 before the world of LLMs/GenAI.<p>In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.<p>GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.<p>As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.
Doesn’t OpenAI embedding model support 8191/8192 tokens? That aside, declaring a winner by token size is misleading. There are more important factors like cross language support and precision for example
if anything i would consider embeddings bit overrated, or it is safer to underrate them.<p>They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)<p>Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.
Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).<p>This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: <a href="https://tomthe.github.io/hackmap/" rel="nofollow">https://tomthe.github.io/hackmap/</a>
I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.<p>I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.<p>How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?<p>In the end I found it easier to just ask an LLM to group articles by topic.
Great post!<p>One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: <a href="https://docs.voyageai.com/reference/embeddings-api" rel="nofollow">https://docs.voyageai.com/reference/embeddings-api</a>.
Embeddings from things like one-hot, count vectorization, tf-idf, etc into dimensionality reduction techniques like SVD and PCA have been around for a long time and also provided the ability to compare any two pieces of text to each other. Yes, neural networks and LLMs have provided the ability for the context of each word to affect the whole document's embedding and capture more meaning, potentially that pesky "semantic" sort even; but they still are fundamentally a dimensionality reduction technique.
This article really resonates with me - I've heard people (and vector database companies) describe transformer embeddings + vector databases as primarily a solution for "memory/context for your chatbot, to mitigate hallucinations", which seems like a really specific (and kinda dubious, in my experience) use case for a really general tool.<p>I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.
My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.
I'm not sure why the voyage-3 models aren't on the MTEB leaderboard. The code for the leaderboard suggests they should be there: <a href="https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae9e2db6d721cacc15cb923f29b8bb9115a4" rel="nofollow">https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...</a><p>But I don't see them when I filter the list for 'voyage'.
I wonder if this can be used to detect code similarity between e.g. function or files etc.? Or are the existing algorithms overly trained on written prose?
Embeddings are indeed great.
I have been using it a lot.<p>Even wrote about it at:
<a href="https://blog.dobror.com/2024/08/30/how-embeddings-make-your-email-client-better/" rel="nofollow">https://blog.dobror.com/2024/08/30/how-embeddings-make-your-...</a>
Embeddings are a new jump to universality, like the alphabet or numbers. <a href="https://thebeginningofinfinity.xyz/Jump%20to%20Universality" rel="nofollow">https://thebeginningofinfinity.xyz/Jump%20to%20Universality</a>
I have made several successful products in the past few years using primarily embeddings and cosine similarity. Can recommend. It’s amazingly effective (compared to what most people are using today anyway).
The title of the post says they are underrated, but doesn't provide any real justification beyond saying - they are good for x.<p>I am not denying their usefulness, but it's misleading.
It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.
This article shows the incorrect value for the OpenAI text-embedding-3-large Input Limit as 3072 which is actually its output limit [1]. The correct value is 8191 [2].<p>Edit: This value has now been fixed in the article.<p>[1] <a href="https://platform.openai.com/docs/models/embeddings#embeddings" rel="nofollow">https://platform.openai.com/docs/models/embeddings#embedding...</a><p>[2] <a href="https://platform.openai.com/docs/guides/embeddings/#embedding-models" rel="nofollow">https://platform.openai.com/docs/guides/embeddings/#embeddin...</a><p>Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.