TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: RAG Large Data Pipeline with 1000 Models

2 pointsby zh24089 months ago
Hi, I&#x27;m a PhD in data + LLMs. I&#x27;m building an LLM chatbot for dbt.<p>The challenge isn&#x27;t LLMs but dbt pipelines, which are too large (e.g., &gt;1K models) to fit in the context window.<p>Traditional vector RAG only works well for texts but poorly for SQLs.<p>To solve this, we built a novel RAG using lineage.<p>I tested it on dbt projects with 1000+ models, and it works very well.<p>Some cool use cases the such chatbot does well: - Models discovery (I have a high-level question, which tables to use?)<p>- Safe model edits (I want to edit the models, what downstream models are affected?)<p>- Model debugging (This column looks wrong, how is it computed upstream?)<p>- New pipelines prototyping (I want to add a new metric, how are similar metrics computed?)<p>- Natural language querying (I want to understand customer better, recommend some queries)<p>- Pipelines optimizations (This model is slow, any inefficiency in the pipeline?) - etc.<p>Live demo on RAG for the Shopify dbt project (built by Fivetran): <a href="https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;pipeline" rel="nofollow">https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;pipeline</a><p>Enter your question, and it will generate a response live (refresh the page for the latest messages).<p>Video Demo: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=kv5mwTkpfY0" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=kv5mwTkpfY0</a><p>Notebook to RAG your dbt: <a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;Cocoon-Data-Transformation&#x2F;cocoon&#x2F;blob&#x2F;main&#x2F;demo&#x2F;Cocoon_RAG_pipeline.ipynb" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;Cocoon-Data-Transfo...</a><p>You&#x27;ll need to provide LLM APIs (Claude 3.5 strongly recommended) and a dbt project (just need target&#x2F;manifest.json).<p>The project is open-sourced: <a href="https:&#x2F;&#x2F;github.com&#x2F;Cocoon-Data-Transformation&#x2F;cocoon">https:&#x2F;&#x2F;github.com&#x2F;Cocoon-Data-Transformation&#x2F;cocoon</a>

no comments

no comments