I built this tool because I wanted a way to just take a bunch of URLs or domains, and query their content in RAG applications.<p>It takes away the pain of crawling, extracting content, chunking, vectorizing, and updating periodically.<p>I'm curious to see if it can be useful to others. I meant to launch this six months ago but life got in the way...
I built a similar thing as a python library that does just that: <a href="https://github.com/philippe2803/contentmap">https://github.com/philippe2803/contentmap</a><p>Blog post that explains the rationale behind the library:
<a href="https://philippeoger.com/pages/can-we-rag-the-whole-web" rel="nofollow">https://philippeoger.com/pages/can-we-rag-the-whole-web</a><p>Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.
I tried it out. This would be extremely useful to me to the point I'd be willing to happily pay for it, as it's something I would have otherwise had to spend a long time hacking together.<p>1) The returned output from a query seems pretty limited in length and breadth.<p>2) No apparent way to adjust my prompts to improve/adjust the output e.g. not really 'conversational' (not sure if that is your intent)<p>Otherwise keep developing and be sure to push update notifications to your new mailing list! ;-)
In my opinion this is a transitional niche.<p>Soon websites/apps whatever you want to call them will have their own built-in handling for AI.<p>It's inefficient and rude to be scraping pages for content. Especially for profit.
I spent a lot of time thinking about how to manage embeddings for docs sites. This is basically the same solution that I landed on but never got around to shipping as a general-purpose product.<p>A key question that the docs should answer (and perhaps the "How it works" page too): chunking. You generate an embedding for the entire page? Or do you generate embeddings for sections? And what's the size limit per page? Some of our docs pages have thousands of words per page. I'm doubtful you can ingest all that, let alone whether the embedding would be that useful in practice.
I like this. Abstracting away the management of embeddings and vector database is something I desperately want, and adding in website crawling is useful as well.
I like this a lot!<p>But: I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away.<p>Sort of like a weird GenAI take on Cixin Liu's Dark Forest hypothesis (<a href="https://en.wikipedia.org/wiki/Dark_forest_hypothesis" rel="nofollow">https://en.wikipedia.org/wiki/Dark_forest_hypothesis</a>).<p>(Edited to add a reference.)
Does anyone know of a way to do this locally with Ollama? The 'chat with documentation' thing is something I was thinking of a week ago when dealing with hallucinating cloud AI. I think it'd be worth the energy to embed a set of documentation locally to help with development
Looks cool! anything about how it compares to similar RAG-as-a-service products? something I've been researching a little.<p>FWIW, the pricing model of jumping from free to "contact us" is slightly ominous.
<p><pre><code> > Turn any website into a knowledge base for LLMs
</code></pre>
I would pay for the opposite product: make your website completely unusable/unreadable by LLMs while readable by real humans, with low false positive rates.
Could you support ingesting WARC files?<p><a href="https://github.com/harvard-lil/warc-gpt">https://github.com/harvard-lil/warc-gpt</a><p><a href="https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/" rel="nofollow">https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open...</a>
How are you deciding on the best RAG configuration for your app? How you decide chunking strategy, embedding and retrievers for your app?
Check out our open source tool-RAgBuilder that allows developers get to the top performing RAG for their data
<a href="https://news.ycombinator.com/item?id=41145093">https://news.ycombinator.com/item?id=41145093</a>
How does this handle changes to the website? Does it re-crawl the whole site periodically and regenerate the embeddings? Or is there some sort of diff-checker that only picks up pages that have changed, added, or deleted?
Interesting, I wanted to do this for a personal use case (mostly learning), but with PDFs. What's tech stack? I have explored using the AWS AI tools, but it seems a bit overkill for what I want it to do.
Nice! What's the underlying model / RAG approach being used? Be good to understand that part as presumably it will have a big impact on performance / usability of the results.
I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.
I like the concept, the documentation is very good and I even enjoy the domain name. This is an excellent launch and congratulations on getting it out.
Can I query multiple vectorized websites at once? Can I export vectorized websites and host them myself? Any chance to export them to a no-code format, like PDF?
I find it interesting that as an (edit: UK) academic researcher, I would be likely be forbidden to use tools like this, that fail basic ethics standards, regulations such as GDPR, and practical standards such as respecting robots.txt [given there's no information on embedding.io, it's unlikely I can block the crawler when designing a website].<p>There's still room for an ethical development of such crawlers and technologies, but it needs to be consent-first, with strong ethical and legal standards. The crazy development of such tools has been a massive issue for a number of small online organisations that struggle with poorly implemented or maintained bots (as discussed for OpenStreetMap or Read The Docs).
#1. Gratuitous self promotion (but also my honest best advice): The future of knowledge bases is ScrollSets: <a href="https://sets.scroll.pub/" rel="nofollow">https://sets.scroll.pub/</a><p>#2. If you are interested in knowledge bases, see #1
> Enterprise: Contact Us<p>If there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.
I made a similar open source app a year ago or so <a href="https://github.com/mkwatson/chat_any_site">https://github.com/mkwatson/chat_any_site</a>