TechEcho

Hi, I'm a PhD in data + LLMs. I'm building an LLM chatbot for dbt.The challenge isn't LLMs but dbt pipelines, which are too large (e.g., >1K models) to fit in the context window.Traditional vector RAG only works well for texts but poorly for SQLs.To solve this, we built a novel RAG using lineage.I tested it on dbt projects with 1000+ models, and it works very well.Some cool use cases the such chatbot does well: - Models discovery (I have a high-level question, which tables to use?)- Safe model edits (I want to edit the models, what downstream models are affected?)- Model debugging (This column looks wrong, how is it computed upstream?)- New pipelines prototyping (I want to add a new metric, how are similar metrics computed?)- Natural language querying (I want to understand customer better, recommend some queries)- Pipelines optimizations (This model is slow, any inefficiency in the pipeline?) - etc.Live demo on RAG for the Shopify dbt project (built by Fivetran): <a href="https://cocoon-data-transformation.github.io/page/pipeline" rel="nofollow">https://cocoon-data-transformation.github.io/page/pipeline</a>Enter your question, and it will generate a response live (refresh the page for the latest messages).Video Demo: <a href="https://www.youtube.com/watch?v=kv5mwTkpfY0" rel="nofollow">https://www.youtube.com/watch?v=kv5mwTkpfY0</a>Notebook to RAG your dbt: <a href="https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_RAG_pipeline.ipynb" rel="nofollow">https://colab.research.google.com/github/Cocoon-Data-Transfo...</a>You'll need to provide LLM APIs (Claude 3.5 strongly recommended) and a dbt project (just need target/manifest.json).The project is open-sourced: <a href="https://github.com/Cocoon-Data-Transformation/cocoon">https://github.com/Cocoon-Data-Transformation/cocoon</a>

Show HN: RAG Large Data Pipeline with 1000 Models

no comments

Show HN: RAG Large Data Pipeline with 1000 Models

no comments