Find anything fast with Google's vector search technology

368 pointsby sshrootover 3 years ago

26 comments

freediverover 3 years ago

I built multiple systems using vector search, one of them demoed in a search engine for non-commercial content at <a href="http://teclis.com" rel="nofollow">http://teclis.com</a>Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss <a href="https://github.com/facebookresearch/faiss" rel="nofollow">https://github.com/facebookresearch/faiss</a>It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.(worth noting is that there are about two dozen vector search libraries, all benchmarked at <a href="http://ann-benchmarks.com/" rel="nofollow">http://ann-benchmarks.com/</a> and most of them open-source)A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:<a href="https://matchit.magellanic-clouds.com/" rel="nofollow">https://matchit.magellanic-clouds.com/</a>The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.A nice demo of what semantic search can do is Google's Talk to Books <a href="https://books.google.com/talktobooks/" rel="nofollow">https://books.google.com/talktobooks/</a>This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack <a href="https://github.com/deepset-ai/haystack" rel="nofollow">https://github.com/deepset-ai/haystack</a>

评论 #29558506 未加载

评论 #29561035 未加载

评论 #29556556 未加载

评论 #29563497 未加载

评论 #29587055 未加载

评论 #29557907 未加载

评论 #29559449 未加载

评论 #29557595 未加载

评论 #29557662 未加载

评论 #29558453 未加载

323over 3 years ago

People say google search is terrible these days, but I find the opposite.I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.Of course, it doesn't always work, sometimes there are "hash collisions" so to speak, but I don't think the old algorithm would have been more successfully either, since if I knew the exact keywords to use, I wouldn't need to start with a vague description in the first place.

评论 #29556015 未加载

评论 #29555932 未加载

评论 #29555979 未加载

评论 #29556037 未加载

评论 #29557019 未加载

评论 #29557279 未加载

评论 #29556014 未加载

评论 #29560010 未加载

评论 #29562722 未加载

评论 #29556631 未加载

评论 #29556995 未加载

评论 #29557301 未加载

评论 #29557553 未加载

thirdtriggerover 3 years ago

Interesting – we are working on an open source vector search engine called Weaviate and did the same for the complete Wikipedia and Wikidata.[1] Docs: <a href="https://www.semi.technology/developers/weaviate/current/" rel="nofollow">https://www.semi.technology/developers/weaviate/current/</a>[2] Github: <a href="https://github.com/semi-technologies/weaviate" rel="nofollow">https://github.com/semi-technologies/weaviate</a>[3] Wikipedia demo dataset: <a href="https://github.com/semi-technologies/semantic-search-through-Wikipedia-with-Weaviate" rel="nofollow">https://github.com/semi-technologies/semantic-search-through...</a>[4] Wikidata dataset: <a href="https://github.com/semi-technologies/biggraph-wikidata-search-with-weaviate" rel="nofollow">https://github.com/semi-technologies/biggraph-wikidata-searc...</a>Last week there was also a feature on Techcrunch about vector search and Weaviate: <a href="https://techcrunch.com/2021/12/11/2246180/" rel="nofollow">https://techcrunch.com/2021/12/11/2246180/</a>

评论 #29556917 未加载

评论 #29566843 未加载

评论 #29558030 未加载

gk1over 3 years ago

It's great to see more and more talk of vector search and vector databases. We've been promoting this technology for over a year now and have several intro articles for anyone looking to learn more[1], and a generous free tier on our vector search service[2] for anyone looking to give vector search a shot.[1] <a href="https://www.pinecone.io/learn/" rel="nofollow">https://www.pinecone.io/learn/</a>[2] <a href="https://app.pinecone.io/" rel="nofollow">https://app.pinecone.io/</a>We are also actively researching the space, and just recently published a paper on improving Google's ScaNN: <a href="https://arxiv.org/abs/2112.02179" rel="nofollow">https://arxiv.org/abs/2112.02179</a>

评论 #29555844 未加载

评论 #29555675 未加载

评论 #29560740 未加载

评论 #29557972 未加载

eobover 3 years ago

My 2022 wish list is a Postgres plugin that adds vector + AKNN support that plays well with relational queries. There are so many use cases of that.I believe Ant Financial has published an open source one but iirc the English language documentation is sparse.

评论 #29557608 未加载

评论 #29557266 未加载

评论 #29557586 未加载

sligover 3 years ago

Let's say I have a content website with about 20k content pages. I want to automatically cluster the pages so that the each page has the related content linked. Right now I'm using a hacked together tf–idf using sklearn and Python2, and it just works. The downsides are that I have to compute everything offline whenever I add new content, and that it's one more thing to maintain/upgrade.I'm wondering if anyone has a suggestion of a SaaS or another alternative for my use case? Thanks!

评论 #29589977 未加载

评论 #29561739 未加载

visargaover 3 years ago

What if we had local vector search on our web browser history (the content as it was displayed)? That would be radical. I'm wondering why browser vendors don't scramble to create the personal vector database. It could be integrated through a browser extension to insert local results when doing regular web searches, or provide context for a speech based personal assistant. Having a neural net at hand could also prove useful in semantic filtering of webpages (hide or highlight content) and curating your news feeds.

评论 #29563701 未加载

ShamelessCover 3 years ago

This gh repo makes it pretty easy to create similar tech by first embedding any images you have using the released "CLIP" model from Open AI and then creating a Faiss index over these embeds for quick retrieval/decode. You can then do text->image, and image->image semantic search.<a href="https://github.com/rom1504/clip-retrieval" rel="nofollow">https://github.com/rom1504/clip-retrieval</a>

monkeybuttonover 3 years ago

If you are interested in how ScaNN compares to other approximation algorithms, there are some benchmarks here: <a href="http://ann-benchmarks.com/" rel="nofollow">http://ann-benchmarks.com/</a>

pfd1986over 3 years ago

More "Find _something_ fast with vector search". I was not successful in finding anything relevant. PageRank works because it _ranks_ pages by, among other features, number and quality of visitors.E.g. searching for Huxley quote gives me silly blog posts about saving money.Query: "The function of the brain and nervous system is to protect us from being overwhelmed and confused by this mass of largely useless and irrelevant knowledge, by shutting out most of what we should otherwise perceive or remember at any moment, and leaving only that very small and special selection which is likely to be practically useful."Answer: "How to trick your brain into saving money"

andre-zover 3 years ago

We are developing open-source vector search technology. <a href="https://github.com/qdrant/qdrant" rel="nofollow">https://github.com/qdrant/qdrant</a> It is a neural search engine with extended filtering support that implements a custom modification of the HNSW algorithm for Approximate Nearest Neighbour search. It allows applying search filters, including geolocation, without compromising on results. Developed entirely in Rust language. You can find some demos and documentation here <a href="https://qdrant.tech" rel="nofollow">https://qdrant.tech</a>

评论 #29564642 未加载

currentsapiover 3 years ago

If anyone is interested, I maintain a list of open source vector search engine services[1].Feel free to submit a new issues or merge request if you wish for new library added[1] <a href="https://github.com/currentsapi/awesome-vector-search" rel="nofollow">https://github.com/currentsapi/awesome-vector-search</a>

dorianmariefrover 3 years ago

Will probably be available as Postgres extension at some point. Seems like only special indexing of vectors is needed

ameliusover 3 years ago

Does anyone know of a good benchmark suite for search technology?(And how well does the technique of the article work wrt it?)

评论 #29559453 未加载

评论 #29564009 未加载

shanghaikidover 3 years ago

If you are not using GCP or you want to have an open-source alternative, Please check my project Milvus vector database (<a href="https://milvus.io" rel="nofollow">https://milvus.io</a>).We've published a bunch of demo cases powered by vector database on GitHub. <a href="https://github.com/milvus-io/bootcamp" rel="nofollow">https://github.com/milvus-io/bootcamp</a>We have built Milvus vector database upon ANN libraries like faiss, annoy, nsmlib, etc.We are aiming to create a cloud-scalable vector database. So Milvus comes to the crossroad of vector search and cloud database. There are many interesting system design topics in the development of Milvus 2.0. We will continue to share our experiences and thoughts on this topic.

Hokusaiover 3 years ago

It's not very good. I tried different pictures and the results are almost random.A picture from a cartoon returns from logos to any type of drawing. A picture of a battery returns cars and shops. A picture of food worked as expected and I got more food pictures.

ahurmazdaover 3 years ago

For a similar ANN/vector search capabilities, <a href="https://vespa.ai/" rel="nofollow">https://vespa.ai/</a> is a great open-source solution. Elasticsearch may offer some form of ANN too but need to double check

评论 #29556204 未加载

评论 #29563442 未加载

Kydlawover 3 years ago

There is a lot done vector search technology right now. I was less fortunate when looking at vector storage. I already looked at Pinecone or Weaviate but they are all paid products.Is there some people having feedback on this?

评论 #29559079 未加载

评论 #29559310 未加载

评论 #29559278 未加载

yborisover 3 years ago

I'm curious about Gensim Doc2Vec Model. I used it 3 years ago and got decent results in vectorizing articles and then finding articles that were similar based on input text (half-written article for example).What is new here?<a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py" rel="nofollow">https://radimrehurek.com/gensim/auto_examples/tutorials/run_...</a>

CoolGuySteveover 3 years ago

Is this more or less a k-d tree as a service? Where any distance function can be used to index the data?Or is it something different?

评论 #29556149 未加载

评论 #29556045 未加载

评论 #29555657 未加载

___qover 3 years ago

So how do you game this? "Googlebomb" this? I assume it's harder than keyword-based search? As a search engine, what efforts do I take to stop someone from gaming vector-based search engines?

heisenbitover 3 years ago

Considering the number of possible keywords and comparing this with what are feasible vector lengths I wonder whether vector search is not weaker when looking at the long tail.

estover 3 years ago

What's the sqlite equivilant of vector search engine?

tomcooksover 3 years ago

I would be happy with "find anything with Google search"

评论 #29556219 未加载

Lamad123over 3 years ago

Now I cannot even find a song on google or youtube even though I search several lines of the song's lyrics!!

tomc1985over 3 years ago

So... fuzzy logicEverything old is new again! Again!