Hi, I'm a PhD in data + LLMs. I'm building an LLM chatbot for dbt.<p>The challenge isn't LLMs but dbt pipelines, which are too large (e.g., >1K models) to fit in the context window.<p>Traditional vector RAG only works well for texts but poorly for SQLs.<p>To solve this, we built a novel RAG using lineage.<p>I tested it on dbt projects with 1000+ models, and it works very well.<p>Some cool use cases the such chatbot does well:
- Models discovery (I have a high-level question, which tables to use?)<p>- Safe model edits (I want to edit the models, what downstream models are affected?)<p>- Model debugging (This column looks wrong, how is it computed upstream?)<p>- New pipelines prototyping (I want to add a new metric, how are similar metrics computed?)<p>- Natural language querying (I want to understand customer better, recommend some queries)<p>- Pipelines optimizations (This model is slow, any inefficiency in the pipeline?)
- etc.<p>Live demo on RAG for the Shopify dbt project (built by Fivetran): <a href="https://cocoon-data-transformation.github.io/page/pipeline" rel="nofollow">https://cocoon-data-transformation.github.io/page/pipeline</a><p>Enter your question, and it will generate a response live (refresh the page for the latest messages).<p>Video Demo: <a href="https://www.youtube.com/watch?v=kv5mwTkpfY0" rel="nofollow">https://www.youtube.com/watch?v=kv5mwTkpfY0</a><p>Notebook to RAG your dbt: <a href="https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_RAG_pipeline.ipynb" rel="nofollow">https://colab.research.google.com/github/Cocoon-Data-Transfo...</a><p>You'll need to provide LLM APIs (Claude 3.5 strongly recommended) and a dbt project (just need target/manifest.json).<p>The project is open-sourced: <a href="https://github.com/Cocoon-Data-Transformation/cocoon">https://github.com/Cocoon-Data-Transformation/cocoon</a>