I worked on a search engine for patents about 10 years ago that used a vector embedding based on bag-of-words and an autoencoder that was strong for "more like this" queries like "write a paragraph describing your invention and find relevant patents, applications and other literature".<p>Today I use the embeddings from<p><a href="https://sbert.net/" rel="nofollow">https://sbert.net/</a><p>to do search, clustering, and classification. The big advantage of sbert.net is that it is super easy to use, when I was getting started it took me 20 minutes to build a classifier based on it and it performed better than the classifier I was already using that took me a weekend to code up.<p>I developed another classifier using "fine tuning" of BERT models and after about two weeks of futzing around with it, sometimes running experiments overnight, I developed a fairly reliable procedure for creating models that perform about as well as my embedding-based model but this takes about 30 minutes to run while the embedding-based model takes more like 30 seconds. (My problem is pretty noisy and probably doesn't benefit from fine tuning as much as some tasks would)<p>Many people think sbert.net is getting long in the tooth and would benefit from a better model. You could certainly do something similar with a state-of-the-art model such as GPT-4 or T5, but those might not be tuned up for similarity search and similar applications. People are reporting results that they like but I haven't seen a real evaluation run on them.<p>The main complaint people have is that these models have a limited attention window, thus document size over which the system can fully take context into account. This is not just an issue of a "better model" but it is a whole-system problem ranging from how you tokenize the text to what you do with the vectors after you embed them, this talk is the best explanation of the landscape I've seen yet<p><a href="https://www.youtube.com/watch?v=BczDZ59seII">https://www.youtube.com/watch?v=BczDZ59seII</a>