TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What do you know about vector embeddings?

6 pointsby yujianalmost 2 years ago

1 comment

PaulHoulealmost 2 years ago
I worked on a search engine for patents about 10 years ago that used a vector embedding based on bag-of-words and an autoencoder that was strong for &quot;more like this&quot; queries like &quot;write a paragraph describing your invention and find relevant patents, applications and other literature&quot;.<p>Today I use the embeddings from<p><a href="https:&#x2F;&#x2F;sbert.net&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sbert.net&#x2F;</a><p>to do search, clustering, and classification. The big advantage of sbert.net is that it is super easy to use, when I was getting started it took me 20 minutes to build a classifier based on it and it performed better than the classifier I was already using that took me a weekend to code up.<p>I developed another classifier using &quot;fine tuning&quot; of BERT models and after about two weeks of futzing around with it, sometimes running experiments overnight, I developed a fairly reliable procedure for creating models that perform about as well as my embedding-based model but this takes about 30 minutes to run while the embedding-based model takes more like 30 seconds. (My problem is pretty noisy and probably doesn&#x27;t benefit from fine tuning as much as some tasks would)<p>Many people think sbert.net is getting long in the tooth and would benefit from a better model. You could certainly do something similar with a state-of-the-art model such as GPT-4 or T5, but those might not be tuned up for similarity search and similar applications. People are reporting results that they like but I haven&#x27;t seen a real evaluation run on them.<p>The main complaint people have is that these models have a limited attention window, thus document size over which the system can fully take context into account. This is not just an issue of a &quot;better model&quot; but it is a whole-system problem ranging from how you tokenize the text to what you do with the vectors after you embed them, this talk is the best explanation of the landscape I&#x27;ve seen yet<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=BczDZ59seII">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=BczDZ59seII</a>
评论 #36060638 未加载