Show HN: Turn any website into a knowledge base for LLMs

305 pointsby tompec10 months ago

I built this tool because I wanted a way to just take a bunch of URLs or domains, and query their content in RAG applications.It takes away the pain of crawling, extracting content, chunking, vectorizing, and updating periodically.I'm curious to see if it can be useful to others. I meant to launch this six months ago but life got in the way...

42 comments

vladde10 months ago

The example API key on the page is decoded to "WOW YOU'RE A HACKER" :)

评论 #41141804 未加载

jeanloolz10 months ago

I built a similar thing as a python library that does just that: <a href="https://github.com/philippe2803/contentmap">https://github.com/philippe2803/contentmap</a>Blog post that explains the rationale behind the library: <a href="https://philippeoger.com/pages/can-we-rag-the-whole-web" rel="nofollow">https://philippeoger.com/pages/can-we-rag-the-whole-web</a>Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.

评论 #41144257 未加载

评论 #41130076 未加载

23B110 months ago

I tried it out. This would be extremely useful to me to the point I'd be willing to happily pay for it, as it's something I would have otherwise had to spend a long time hacking together.1) The returned output from a query seems pretty limited in length and breadth.2) No apparent way to adjust my prompts to improve/adjust the output e.g. not really 'conversational' (not sure if that is your intent)Otherwise keep developing and be sure to push update notifications to your new mailing list! ;-)

评论 #41122987 未加载

评论 #41123628 未加载

MattDaEskimo10 months ago

In my opinion this is a transitional niche.Soon websites/apps whatever you want to call them will have their own built-in handling for AI.It's inefficient and rude to be scraping pages for content. Especially for profit.

评论 #41125683 未加载

评论 #41125644 未加载

评论 #41126467 未加载

评论 #41126044 未加载

kaycebasques10 months ago

I spent a lot of time thinking about how to manage embeddings for docs sites. This is basically the same solution that I landed on but never got around to shipping as a general-purpose product.A key question that the docs should answer (and perhaps the "How it works" page too): chunking. You generate an embedding for the entire page? Or do you generate embeddings for sections? And what's the size limit per page? Some of our docs pages have thousands of words per page. I'm doubtful you can ingest all that, let alone whether the embedding would be that useful in practice.

评论 #41124959 未加载

crowcroft10 months ago

I like this. Abstracting away the management of embeddings and vector database is something I desperately want, and adding in website crawling is useful as well.

muggermuch10 months ago

I like this a lot!But: I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away.Sort of like a weird GenAI take on Cixin Liu's Dark Forest hypothesis (<a href="https://en.wikipedia.org/wiki/Dark_forest_hypothesis" rel="nofollow">https://en.wikipedia.org/wiki/Dark_forest_hypothesis</a>).(Edited to add a reference.)

评论 #41123720 未加载

评论 #41125769 未加载

评论 #41124915 未加载

replete10 months ago

Does anyone know of a way to do this locally with Ollama? The 'chat with documentation' thing is something I was thinking of a week ago when dealing with hallucinating cloud AI. I think it'd be worth the energy to embed a set of documentation locally to help with development

评论 #41129026 未加载

评论 #41128576 未加载

have_faith10 months ago

Looks cool! anything about how it compares to similar RAG-as-a-service products? something I've been researching a little.FWIW, the pricing model of jumping from free to "contact us" is slightly ominous.

评论 #41129525 未加载

is_true10 months ago

Do you plan on doing revenue sharing with the site owners?

评论 #41131360 未加载

dmitrygr10 months ago

<pre><code> > Turn any website into a knowledge base for LLMs </code></pre> I would pay for the opposite product: make your website completely unusable/unreadable by LLMs while readable by real humans, with low false positive rates.

toomuchtodo10 months ago

Could you support ingesting WARC files?<a href="https://github.com/harvard-lil/warc-gpt">https://github.com/harvard-lil/warc-gpt</a><a href="https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/" rel="nofollow">https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open...</a>

ashwinnair9910 months ago

How are you deciding on the best RAG configuration for your app? How you decide chunking strategy, embedding and retrievers for your app? Check out our open source tool-RAgBuilder that allows developers get to the top performing RAG for their data <a href="https://news.ycombinator.com/item?id=41145093">https://news.ycombinator.com/item?id=41145093</a>

lua-steve10 months ago

How does this handle changes to the website? Does it re-crawl the whole site periodically and regenerate the embeddings? Or is there some sort of diff-checker that only picks up pages that have changed, added, or deleted?

samuria10 months ago

Interesting, I wanted to do this for a personal use case (mostly learning), but with PDFs. What's tech stack? I have explored using the AWS AI tools, but it seems a bit overkill for what I want it to do.

评论 #41109851 未加载

评论 #41107894 未加载

评论 #41107447 未加载

评论 #41126553 未加载

dazbradbury10 months ago

Nice! What's the underlying model / RAG approach being used? Be good to understand that part as presumably it will have a big impact on performance / usability of the results.

jakubsuchy10 months ago

I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.

评论 #41125650 未加载

评论 #41126707 未加载

评论 #41126835 未加载

hluska10 months ago

I like the concept, the documentation is very good and I even enjoy the domain name. This is an excellent launch and congratulations on getting it out.

danirogerc10 months ago

Can I query multiple vectorized websites at once? Can I export vectorized websites and host them myself? Any chance to export them to a no-code format, like PDF?

评论 #41123587 未加载

Cynddl10 months ago

I find it interesting that as an (edit: UK) academic researcher, I would be likely be forbidden to use tools like this, that fail basic ethics standards, regulations such as GDPR, and practical standards such as respecting robots.txt [given there's no information on embedding.io, it's unlikely I can block the crawler when designing a website].There's still room for an ethical development of such crawlers and technologies, but it needs to be consent-first, with strong ethical and legal standards. The crazy development of such tools has been a massive issue for a number of small online organisations that struggle with poorly implemented or maintained bots (as discussed for OpenStreetMap or Read The Docs).

评论 #41129183 未加载

ancras10 months ago

This is interesting. Can it work with any website, even say document repositories hosted on standard servers like gitbook?

评论 #41117788 未加载

nashashmi10 months ago

Would you share the source? I want to use this for a private internal network of pages. How would that work?

oars10 months ago

Interested in seeing whether this will be widespread in 5 years or whether sites will have fought back.

评论 #41128609 未加载

alok-g10 months ago

Would be great to use for developer documentation for various languages, frameworks and libraries.

blackeyeblitzar10 months ago

Is there a way to deal with websites where you need to login? Like subscription based sites?

评论 #41107412 未加载

评论 #41122488 未加载

michaelmior10 months ago

This looks interesting, but I get a 404 on the iframe when I try to go into the chat.

评论 #41123878 未加载

评论 #41122061 未加载

breck10 months ago

#1. Gratuitous self promotion (but also my honest best advice): The future of knowledge bases is ScrollSets: <a href="https://sets.scroll.pub/" rel="nofollow">https://sets.scroll.pub/</a>#2. If you are interested in knowledge bases, see #1

mattfrommars10 months ago

So i provide a URL, your service does the crawling of the site?

suyash10 months ago

Can it get content that is gated/behind login ?

评论 #41132136 未加载

vulture91610 months ago

Experiencing many Internal Server Errors.

rvz10 months ago

> Enterprise: Contact UsIf there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.

评论 #41124591 未加载

olalonde10 months ago

Which LLM model is it using?

barrenko10 months ago

Will this work for forums?

rcarmo10 months ago

How do I feed it a sitemap?

评论 #41123466 未加载

ckluis10 months ago

How much does it cost?

boredemployee10 months ago

does it embbed images as well? if not, do you plan to do so?

评论 #41126050 未加载

cranberryturkey10 months ago

how does this work?

评论 #41109877 未加载

评论 #41105235 未加载

评论 #41130852 未加载

suyash10 months ago

any open source tools for doing just this?

r0b0510 months ago

Does it hallucinate much?

mkw505310 months ago

I made a similar open source app a year ago or so <a href="https://github.com/mkwatson/chat_any_site">https://github.com/mkwatson/chat_any_site</a>

评论 #41122691 未加载

pryelluw10 months ago

Does this respect robots.txt?

评论 #41122961 未加载

评论 #41123782 未加载

评论 #41122836 未加载

评论 #41122683 未加载

khanan10 months ago

Can this be deployed on-prem or is it an cloud-toy?

评论 #41123688 未加载

42 comments

vladde10 months ago

The example API key on the page is decoded to "WOW YOU'RE A HACKER" :)

评论 #41141804 未加载

jeanloolz10 months ago

评论 #41144257 未加载

评论 #41130076 未加载

23B110 months ago

评论 #41122987 未加载

评论 #41123628 未加载

MattDaEskimo10 months ago

评论 #41125683 未加载

评论 #41125644 未加载

评论 #41126467 未加载

评论 #41126044 未加载

kaycebasques10 months ago

评论 #41124959 未加载

crowcroft10 months ago

I like this. Abstracting away the management of embeddings and vector database is something I desperately want, and adding in website crawling is useful as well.

muggermuch10 months ago

评论 #41123720 未加载

评论 #41125769 未加载

评论 #41124915 未加载

replete10 months ago

评论 #41129026 未加载

评论 #41128576 未加载

have_faith10 months ago

评论 #41129525 未加载

is_true10 months ago

Do you plan on doing revenue sharing with the site owners?

评论 #41131360 未加载

dmitrygr10 months ago

toomuchtodo10 months ago

ashwinnair9910 months ago

lua-steve10 months ago

samuria10 months ago

评论 #41109851 未加载

评论 #41107894 未加载

评论 #41107447 未加载

评论 #41126553 未加载

dazbradbury10 months ago

Nice! What's the underlying model / RAG approach being used? Be good to understand that part as presumably it will have a big impact on performance / usability of the results.

jakubsuchy10 months ago

I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.

评论 #41125650 未加载

评论 #41126707 未加载

评论 #41126835 未加载

hluska10 months ago

I like the concept, the documentation is very good and I even enjoy the domain name. This is an excellent launch and congratulations on getting it out.

danirogerc10 months ago

Can I query multiple vectorized websites at once? Can I export vectorized websites and host them myself? Any chance to export them to a no-code format, like PDF?

评论 #41123587 未加载

Cynddl10 months ago

评论 #41129183 未加载

ancras10 months ago

This is interesting. Can it work with any website, even say document repositories hosted on standard servers like gitbook?

评论 #41117788 未加载

nashashmi10 months ago

Would you share the source? I want to use this for a private internal network of pages. How would that work?

oars10 months ago

Interested in seeing whether this will be widespread in 5 years or whether sites will have fought back.

评论 #41128609 未加载

alok-g10 months ago

Would be great to use for developer documentation for various languages, frameworks and libraries.

blackeyeblitzar10 months ago

Is there a way to deal with websites where you need to login? Like subscription based sites?

评论 #41107412 未加载

评论 #41122488 未加载

michaelmior10 months ago

This looks interesting, but I get a 404 on the iframe when I try to go into the chat.

评论 #41123878 未加载

评论 #41122061 未加载

breck10 months ago

mattfrommars10 months ago

So i provide a URL, your service does the crawling of the site?

suyash10 months ago

Can it get content that is gated/behind login ?

评论 #41132136 未加载

vulture91610 months ago

Experiencing many Internal Server Errors.

rvz10 months ago

> Enterprise: Contact UsIf there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.

评论 #41124591 未加载

olalonde10 months ago

Which LLM model is it using?

barrenko10 months ago

Will this work for forums?

rcarmo10 months ago

How do I feed it a sitemap?

评论 #41123466 未加载

ckluis10 months ago

How much does it cost?

boredemployee10 months ago

does it embbed images as well? if not, do you plan to do so?

评论 #41126050 未加载

cranberryturkey10 months ago

how does this work?

评论 #41109877 未加载

评论 #41105235 未加载

评论 #41130852 未加载

suyash10 months ago

any open source tools for doing just this?

r0b0510 months ago

Does it hallucinate much?

mkw505310 months ago

I made a similar open source app a year ago or so <a href="https://github.com/mkwatson/chat_any_site">https://github.com/mkwatson/chat_any_site</a>