TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Open-source ETL framework to sync data from SaaS tools to vector stores

63 pointsby jasonwcfanabout 2 years ago
Hey hacker news, we launched a few weeks ago as a GPT-powered chatbot for developer docs, and quickly realized that the value of what we’re doing isn’t the chatbot itself. Rather, it’s the time we save developers by automating the extraction of data from their SaaS tools (Github, Zendesk, Salesforce, etc) and helping transform it to contextually relevant chunks that fit into GPT’s context window.<p>A lot of companies are building prototypes with GPT right now and they’re all using some combination of Langchain&#x2F;Llama Index + Weaviate&#x2F;Pinecone + GPT3.5&#x2F;GPT4 as their stack for retrieval augmented generation (RAG). This works great for prototypes, but what we learned was that as you scale your RAG app to more users and ingest more sources of content, it becomes a real pain to manage your data pipelines.<p>For example, if you want to ingest your developer docs, process it into chunks of &lt;500 tokens, and add those chunks to a vector store, you can build a prototype with Langchain fairly quickly. However, if you want to deploy it to customers like we did for BentoML ([<a href="https:&#x2F;&#x2F;www.bentoml.com&#x2F;](https:&#x2F;&#x2F;www.bentoml.com&#x2F;)" rel="nofollow">https:&#x2F;&#x2F;www.bentoml.com&#x2F;](https:&#x2F;&#x2F;www.bentoml.com&#x2F;)</a>) you’ll quickly realize that a naive chunking method that splits by character&#x2F;token leads to poor results, and that “delete and re-vectorize everything” when the source docs change doesn’t scale as a data synchronization strategy.<p>We took the code we used to build chatbots for our early customers and turned it into an open source framework to rapidly build new data Connectors and Chunkers. This way developers can use community built Connectors and Chunkers to start running vector searches on data from any source in a matter of minutes, or write their own in a matter of hours.<p>Here’s a video demo: [<a href="https:&#x2F;&#x2F;youtu.be&#x2F;I2V3Cu8L6wk](https:&#x2F;&#x2F;youtu.be&#x2F;I2V3Cu8L6wk)" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;I2V3Cu8L6wk](https:&#x2F;&#x2F;youtu.be&#x2F;I2V3Cu8L6wk)</a><p>The repo has instructions on how to get started and set up API endpoints to load, chunk, and vectorize data quickly. Right now it only works with websites and Github repos, but we’ll be adding Zendesk, Google Drive, and Confluence integrations soon too.

2 comments

binarymaxabout 2 years ago
Cool project! It&#x27;s not clear to me in the code from where you are getting embeddings. Are all your embeddings coming from OpenAI? If so, that sounds expensive for personal use.
评论 #35377615 未加载
jn2clarkabout 2 years ago
Looks really interesting! Are you looking for more vector search integrations? we have one here <a href="https:&#x2F;&#x2F;github.com&#x2F;marqo-ai&#x2F;marqo">https:&#x2F;&#x2F;github.com&#x2F;marqo-ai&#x2F;marqo</a> which includes a lot of the transformation logic (including inference). If so, we can do a PR
评论 #35380173 未加载
评论 #35380105 未加载