I built multiple systems using vector search, one of them demoed in a search engine for non-commercial content at <a href="http://teclis.com" rel="nofollow">http://teclis.com</a><p>Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss <a href="https://github.com/facebookresearch/faiss" rel="nofollow">https://github.com/facebookresearch/faiss</a><p>It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.<p>(worth noting is that there are about two dozen vector search libraries, all benchmarked at <a href="http://ann-benchmarks.com/" rel="nofollow">http://ann-benchmarks.com/</a> and most of them open-source)<p>A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).<p>Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:<p><a href="https://matchit.magellanic-clouds.com/" rel="nofollow">https://matchit.magellanic-clouds.com/</a><p>The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.<p>A nice demo of what semantic search can do is Google's Talk to Books <a href="https://books.google.com/talktobooks/" rel="nofollow">https://books.google.com/talktobooks/</a><p>This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack <a href="https://github.com/deepset-ai/haystack" rel="nofollow">https://github.com/deepset-ai/haystack</a>
People say google search is terrible these days, but I find the opposite.<p>I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.<p>Of course, it doesn't always work, sometimes there are "hash collisions" so to speak, but I don't think the old algorithm would have been more successfully either, since if I knew the exact keywords to use, I wouldn't need to start with a vague description in the first place.
Interesting – we are working on an open source vector search engine called Weaviate and did the same for the complete Wikipedia and Wikidata.<p>[1] Docs: <a href="https://www.semi.technology/developers/weaviate/current/" rel="nofollow">https://www.semi.technology/developers/weaviate/current/</a><p>[2] Github: <a href="https://github.com/semi-technologies/weaviate" rel="nofollow">https://github.com/semi-technologies/weaviate</a><p>[3] Wikipedia demo dataset: <a href="https://github.com/semi-technologies/semantic-search-through-Wikipedia-with-Weaviate" rel="nofollow">https://github.com/semi-technologies/semantic-search-through...</a><p>[4] Wikidata dataset: <a href="https://github.com/semi-technologies/biggraph-wikidata-search-with-weaviate" rel="nofollow">https://github.com/semi-technologies/biggraph-wikidata-searc...</a><p>Last week there was also a feature on Techcrunch about vector search and Weaviate: <a href="https://techcrunch.com/2021/12/11/2246180/" rel="nofollow">https://techcrunch.com/2021/12/11/2246180/</a>
It's great to see more and more talk of vector search and vector databases. We've been promoting this technology for over a year now and have several intro articles for anyone looking to learn more[1], and a generous free tier on our vector search service[2] for anyone looking to give vector search a shot.<p>[1] <a href="https://www.pinecone.io/learn/" rel="nofollow">https://www.pinecone.io/learn/</a><p>[2] <a href="https://app.pinecone.io/" rel="nofollow">https://app.pinecone.io/</a><p>We are also actively researching the space, and just recently published a paper on improving Google's ScaNN: <a href="https://arxiv.org/abs/2112.02179" rel="nofollow">https://arxiv.org/abs/2112.02179</a>
My 2022 wish list is a Postgres plugin that adds vector + AKNN support that plays well with relational queries. There are so many use cases of that.<p>I believe Ant Financial has published an open source one but iirc the English language documentation is sparse.
Let's say I have a content website with about 20k content pages. I want to automatically cluster the pages so that the each page has the related content linked. Right now I'm using a hacked together tf–idf using sklearn and Python2, and it just works. The downsides are that I have to compute everything offline whenever I add new content, and that it's one more thing to maintain/upgrade.<p>I'm wondering if anyone has a suggestion of a SaaS or another alternative for my use case? Thanks!
What if we had local vector search on our web browser history (the content as it was displayed)? That would be radical. I'm wondering why browser vendors don't scramble to create the personal vector database. It could be integrated through a browser extension to insert local results when doing regular web searches, or provide context for a speech based personal assistant. Having a neural net at hand could also prove useful in semantic filtering of webpages (hide or highlight content) and curating your news feeds.
This gh repo makes it pretty easy to create similar tech by first embedding any images you have using the released "CLIP" model from Open AI and then creating a Faiss index over these embeds for quick retrieval/decode. You can then do text->image, and image->image semantic search.<p><a href="https://github.com/rom1504/clip-retrieval" rel="nofollow">https://github.com/rom1504/clip-retrieval</a>
If you are interested in how ScaNN compares to other approximation algorithms, there are some benchmarks here:
<a href="http://ann-benchmarks.com/" rel="nofollow">http://ann-benchmarks.com/</a>
More "Find _something_ fast with vector search". I was not successful in finding anything relevant. PageRank works because it _ranks_ pages by, among other features, number and quality of visitors.<p>E.g. searching for Huxley quote gives me silly blog posts about saving money.<p>Query: "The function of the brain and nervous system is to protect us from being overwhelmed and confused by this mass of largely useless and irrelevant knowledge, by shutting out most of what we should otherwise perceive or remember at any moment, and leaving only that very small and special selection which is likely to be practically useful."<p>Answer: "How to trick your brain into saving money"
We are developing open-source vector search technology. <a href="https://github.com/qdrant/qdrant" rel="nofollow">https://github.com/qdrant/qdrant</a> It is a neural search engine with extended filtering support that implements a custom modification of the HNSW algorithm for Approximate Nearest Neighbour search.
It allows applying search filters, including geolocation, without compromising on results. Developed entirely in Rust language. You can find some demos and documentation here <a href="https://qdrant.tech" rel="nofollow">https://qdrant.tech</a>
If anyone is interested, I maintain a list of open source vector search engine services[1].<p>Feel free to submit a new issues or merge request if you wish for new library added<p>[1] <a href="https://github.com/currentsapi/awesome-vector-search" rel="nofollow">https://github.com/currentsapi/awesome-vector-search</a>
If you are not using GCP or you want to have an open-source alternative, Please check my project Milvus vector database (<a href="https://milvus.io" rel="nofollow">https://milvus.io</a>).<p>We've published a bunch of demo cases powered by vector database on GitHub. <a href="https://github.com/milvus-io/bootcamp" rel="nofollow">https://github.com/milvus-io/bootcamp</a><p>We have built Milvus vector database upon ANN libraries like faiss, annoy, nsmlib, etc.<p>We are aiming to create a cloud-scalable vector database. So Milvus comes to the crossroad of vector search and cloud database. There are many interesting system design topics in the development of Milvus 2.0. We will continue to share our experiences and thoughts on this topic.
It's not very good. I tried different pictures and the results are almost random.<p>A picture from a cartoon returns from logos to any type of drawing.
A picture of a battery returns cars and shops.
A picture of food worked as expected and I got more food pictures.
For a similar ANN/vector search capabilities, <a href="https://vespa.ai/" rel="nofollow">https://vespa.ai/</a> is a great open-source solution. Elasticsearch may offer some form of ANN too but need to double check
There is a lot done vector search technology right now.
I was less fortunate when looking at vector storage.
I already looked at Pinecone or Weaviate but they are all paid products.<p>Is there some people having feedback on this?
I'm curious about Gensim <i>Doc2Vec</i> Model. I used it 3 years ago and got decent results in vectorizing articles and then finding articles that were similar based on input text (half-written article for example).<p>What is new here?<p><a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py" rel="nofollow">https://radimrehurek.com/gensim/auto_examples/tutorials/run_...</a>
So how do you game this? "Googlebomb" this? I assume it's harder than keyword-based search? As a search engine, what efforts do I take to stop someone from gaming vector-based search engines?
Considering the number of possible keywords and comparing this with what are feasible vector lengths I wonder whether vector search is not weaker when looking at the long tail.