This is exactly what <a href="https://www.perplexity.ai/" rel="nofollow">https://www.perplexity.ai/</a> is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.<p>The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.<p>For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.
Cool idea. This is a decentralized RAG approach and useful for individual site, e.g. those from Wordpress. How do you find the site that you want to "RAG" on, though? Individual domains can be vast, e.g. Google itself.
Well, there's nothing new under the sun. The whatever cooperation model you may have come up with, it has been invented again, and again, and again.<p>Before you invent a new protocol, look at Semantic Web (RDF et al), and Google Microformats, and...
FIYDRI^: The core idea discussed in this post is less about RAG and more about sharing web content in packages that are easier for crawlers to access - including an experiment that uses downloadable SQLite databases for that.<p>^ For If You Didn't Read It
I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.<p>Here's their blog article for it: <a href="https://help.kagi.com/kagi/ai/quick-answer.html" rel="nofollow">https://help.kagi.com/kagi/ai/quick-answer.html</a>
You have to fire up your bullshit detector when looking at the results, but I find it saves a good 3/4 clicks on average.
"RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses."