Embeddings-based search is a nice improvement on search, but it's still search. Relative to ChatGPT answering on its training data, I find embeddings-based search to be severely lacking. The right comparison is to traditional search, where it becomes favorable.<p>It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.<p>Maybe with more context I'll change my tune, but it's very much at the whim of the context retrieval finding everything you need to answer the query. That's easy for stuff that search is already good at, and so provides a better interface for search. But it's hard for stuff that search isn't good at, because, well: it's search.
> "Once these models achieve a high level of comprehension, training larger models with more data may not offer significant improvements (not to be mistaken with reinforcement learning through human feedback). Instead, providing LLMs with real-time, relevant data for interpretation and understanding can make them more valuable."<p>To me this viewpoint looks totally alien. Imagine you have been training this model to predict the next token. At first it can barely interleave vowels and consonants. Then it can start making words, then whole sentences. Then it starts unlocking every cognitive ability one by one. It begins to pass nearly every human test and certification exam and psychological test of theory of mind.<p>Now imagine thinking at this point "training larger models with more data may not offer significant improvements" and deciding that's why you stop scaling it. That makes absolutely no sense to me unless 1) you have no imagination or 2) you want to stop because you are scared to make superhuman intelligence or 3) you are lying to throw off competitors or regulators or other people.
I get annoyed by articles like this. Yes, it's cool to educate readers who aren't aware of embeddings/embeddings stores/vectorDB technologies that this is possible.<p>What these articles don't touch on is what to do once you've got the most relevant documents. Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)? What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations). Doing that will be useful but still lossey<p>What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM<p>Figuring out which of these choices to make, along with an awful lot more choices I'm likely not even thinking about right now, is what will seperate the useful from the useless LLM + Extractive knowledge systems
One caveat about about embedding based retrieval is that there is no guarantee that the embedded documents will look like the query.<p>One trick is to have a LLM hallucinate a document based on the query, and then embed that hallucinated document. Unfortunately this increases the latency since it incurs another round trip to the LLM.
I'm working on something where I need to basically add on the order of 150,000 tokens into the knowledge base of an LLM. Finding out slowly I need to delve into training a whole ass LLM to do it. Sigh.
Search query expansion: <a href="https://en.wikipedia.org/wiki/Query_expansion" rel="nofollow">https://en.wikipedia.org/wiki/Query_expansion</a><p>We've done this in NLP and search forever. I guess even SQL query planners and other things that automatically rewrite queries might count.<p>It's just that now the parameters seem squishier with a prompt interface. It's almost like we need some kind of symbolic structure again.
If you are wondering what the latest is on giving LLM's access to large amounts of data, I think this article is a good start. Seems like this is a space where there will be a ton of innovation so interested to learn what else is coming.
A similar idea is been developed in: <a href="https://github.com/pieroit/cheshire-cat">https://github.com/pieroit/cheshire-cat</a>
>There is an important part of this prompt that is partially cut off from the image:<p>>> “If you don't know the answer, just say that you don't know, don't try to make up an answer”<p>//<p>It seems silly to make this part of the prompt rather than a separate parameter, surely we could design the response to be close to factual. Then run a checker to ascertain a score for the factuality of the output?
Can we build a model based purely on search?<p>The model searches until it finds an answer, including distance and resolution<p>Search is performed by a DB, the query then sub-queries LLMs on a tree of embeddings<p>Each coordinate of an embedding vector is a pair of coordinate and LLM<p>Like a dynamic dictionary, in which the definition for the word is an LLM trained on the word<p>Indexes become shortcuts to meanings that we can choose based on case and context<p>Does this exist already?
This is like asking gpt to summarize what it found on Google, this is basically what bing does when you try to find stuff like hotels and other recent subjects. Not the revolution we are all expecting
"Infinite" is a technical term with a highly specific meaning.<p>In this case, it can't possibly be approached. It certainly can't be attained.<p>Borges' Library of Babel, which represents all possible combinations of letters that can fit into a 400-page book, only contains some 25^1312000 books. And the overwhelming majority of its books are full of gibberish. The amount of "knowledge" that a LLM can learn or describe is VERY strictly bounded and strictly finite. (This is perhaps its defining characteristic.)<p>I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.
I think someone did this <a href="https://github.com/pashpashpash/vault-ai">https://github.com/pashpashpash/vault-ai</a>