Embeddings are underrated

367 pointsby misonic7 months ago

32 comments

mrob7 months ago

Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.

评论 #42019051 未加载

评论 #42017597 未加载

评论 #42019450 未加载

评论 #42015813 未加载

评论 #42020908 未加载

huijzer7 months ago

> Is it terrible for the environment?> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.

评论 #42017107 未加载

评论 #42016910 未加载

quantadev7 months ago

That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.

评论 #42014866 未加载

评论 #42020297 未加载

评论 #42042702 未加载

评论 #42014449 未加载

评论 #42014776 未加载

评论 #42014911 未加载

ericholscher7 months ago

This is a great post. I’ve also been having a lot of fun working with embeddings, with lots of those pages being documentation. We write up a quick post on how are using them in prod, if you want to go from having an embedding to actually using them in a web app:<a href="https://www.ethicalads.io/blog/2024/04/using-embeddings-in-production-with-postgres-django-for-niche-ad-targeting/" rel="nofollow">https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...</a>

评论 #42017139 未加载

joerick7 months ago

The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.How could I derive an embedding that represents only content and not tone?

评论 #42016269 未加载

评论 #42016488 未加载

评论 #42016332 未加载

评论 #42018669 未加载

评论 #42016317 未加载

评论 #42022805 未加载

评论 #42015999 未加载

评论 #42019942 未加载

mlinksva7 months ago

<a href="https://technicalwriting.dev/data/embeddings.html#let-a-thousand-embeddings-bloom" rel="nofollow">https://technicalwriting.dev/data/embeddings.html#let-a-thou...</a>> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)

评论 #42019766 未加载

评论 #42025136 未加载

评论 #42020287 未加载

nerdright7 months ago

Great post indeed! I totally agree that embeddings are underrated. I feel like the "information retrieval/discovery" world is stuck using spears (i.e., term/keyword-based discovery) instead of embracing the modern tools (i.e., semantic-based discovery).The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!

评论 #42017611 未加载

kaycebasques7 months ago

Cool, first time I've seen one of my posts trend without me submitting it myself. Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.

评论 #42014384 未加载

评论 #42015651 未加载

评论 #42014107 未加载

dmezzetti7 months ago

Author of txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>) here. I've been in the embeddings space since 2020 before the world of LLMs/GenAI.In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.

thund7 months ago

Doesn’t OpenAI embedding model support 8191/8192 tokens? That aside, declaring a winner by token size is misleading. There are more important factors like cross language support and precision for example

评论 #42015322 未加载

评论 #42019301 未加载

ggnore74527 months ago

if anything i would consider embeddings bit overrated, or it is safer to underrate them.They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.

评论 #42020465 未加载

评论 #42018821 未加载

评论 #42020090 未加载

tomthe7 months ago

Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: <a href="https://tomthe.github.io/hackmap/" rel="nofollow">https://tomthe.github.io/hackmap/</a>

评论 #42017068 未加载

adamgordonbell7 months ago

I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?In the end I found it easier to just ask an LLM to group articles by topic.

评论 #42015765 未加载

评论 #42018155 未加载

fzliu7 months ago

Great post!One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: <a href="https://docs.voyageai.com/reference/embeddings-api" rel="nofollow">https://docs.voyageai.com/reference/embeddings-api</a>.

imgabe7 months ago

Is there any benefit to fine-tuning a model on your corpus before using it to generate embeddings? Would that improve the quality of the matches?

评论 #42015635 未加载

Aeolun7 months ago

Is there some way to compare different embeddings for different use cases?

评论 #42015325 未加载

esafak7 months ago

Underrated by people are unfamiliar with machine learning, maybe.

评论 #42016658 未加载

评论 #42020196 未加载

评论 #42019426 未加载

freediver7 months ago

What would be really cool if somebody figured out how to do embeddings -> text.

评论 #42017969 未加载

评论 #42017130 未加载

评论 #42020532 未加载

评论 #42017217 未加载

评论 #42017669 未加载

hambandit7 months ago

Embeddings from things like one-hot, count vectorization, tf-idf, etc into dimensionality reduction techniques like SVD and PCA have been around for a long time and also provided the ability to compare any two pieces of text to each other. Yes, neural networks and LLMs have provided the ability for the context of each word to affect the whole document's embedding and capture more meaning, potentially that pesky "semantic" sort even; but they still are fundamentally a dimensionality reduction technique.

NameError7 months ago

This article really resonates with me - I've heard people (and vector database companies) describe transformer embeddings + vector databases as primarily a solution for "memory/context for your chatbot, to mitigate hallucinations", which seems like a really specific (and kinda dubious, in my experience) use case for a really general tool.I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.

评论 #42016280 未加载

empiko7 months ago

My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.

评论 #42020576 未加载

评论 #42018910 未加载

评论 #42014943 未加载

rahimnathwani7 months ago

I'm not sure why the voyage-3 models aren't on the MTEB leaderboard. The code for the leaderboard suggests they should be there: <a href="https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae9e2db6d721cacc15cb923f29b8bb9115a4" rel="nofollow">https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...</a>But I don't see them when I filter the list for 'voyage'.

评论 #42014613 未加载

评论 #42014465 未加载

eproxus7 months ago

I wonder if this can be used to detect code similarity between e.g. function or files etc.? Or are the existing algorithms overly trained on written prose?

评论 #42020525 未加载

luizsantana7 months ago

Embeddings are indeed great. I have been using it a lot.Even wrote about it at: <a href="https://blog.dobror.com/2024/08/30/how-embeddings-make-your-email-client-better/" rel="nofollow">https://blog.dobror.com/2024/08/30/how-embeddings-make-your-...</a>

jonathanrmumm7 months ago

Embeddings are a new jump to universality, like the alphabet or numbers. <a href="https://thebeginningofinfinity.xyz/Jump%20to%20Universality" rel="nofollow">https://thebeginningofinfinity.xyz/Jump%20to%20Universality</a>

评论 #42019559 未加载

0x20cowboy7 months ago

I have made several successful products in the past few years using primarily embeddings and cosine similarity. Can recommend. It’s amazingly effective (compared to what most people are using today anyway).

l5870uoo9y7 months ago

Are there any visualization libraries that visualize embeddings in a vector space?

评论 #42015535 未加载

评论 #42016195 未加载

评论 #42016126 未加载

评论 #42030725 未加载

评论 #42019335 未加载

rgavuliak7 months ago

The title of the post says they are underrated, but doesn't provide any real justification beyond saying - they are good for x.I am not denying their usefulness, but it's misleading.

_jonas7 months ago

It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.

ABraidotti7 months ago

This reminds me- I gotta go back and reread Borges's short stories with ML theory in mind.

tootie7 months ago

Is it accurate to say that any data that can be tokenized can be turned into embeddings?

OutOfHere7 months ago

This article shows the incorrect value for the OpenAI text-embedding-3-large Input Limit as 3072 which is actually its output limit [1]. The correct value is 8191 [2].Edit: This value has now been fixed in the article.[1] <a href="https://platform.openai.com/docs/models/embeddings#embeddings" rel="nofollow">https://platform.openai.com/docs/models/embeddings#embedding...</a>[2] <a href="https://platform.openai.com/docs/guides/embeddings/#embedding-models" rel="nofollow">https://platform.openai.com/docs/guides/embeddings/#embeddin...</a>Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.

评论 #42019840 未加载

评论 #42019253 未加载