科技回声

5 条评论

Oras5 个月前

One of the challenges I have with RAG is excluding table of contents, headers/footers and appendices from PDFs.Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.

评论 #42316194 未加载

评论 #42316990 未加载

评论 #42317092 未加载

jonathan-adly5 个月前

I would strongly advise against people learning based on LangChain.It is abstraction hell, and will set you back thousands of engineers hours the moment you want to do something differently.RAG is actually very simple thing to do; just too much VC money in the space & complexity merchants.Best way to learn is outside of notebooks (the hard parts of RAG is all around the actual product), and use as little frameworks as possible.My preferred stack is a FastAPI/numpy/redis. Simple as pie. You can swap redis for pgVector/Postgres when ready for the next complexity step.

评论 #42318563 未加载

评论 #42318745 未加载

评论 #42317203 未加载

Jet_Xu5 个月前

Interesting discussion! While RAG is powerful for document retrieval, applying it to code repositories presents unique challenges that go beyond traditional RAG implementations. I've been working on a universal repository knowledge graph system, and found that the real complexity lies in handling cross-language semantic understanding and maintaining relationship context across different repo structures (mono/poly).Has anyone successfully implemented a language-agnostic approach that can: 1. Capture implicit code relationships without heavy LLM dependency? 2. Scale efficiently for large monorepos while preserving fine-grained semantic links? 3. Handle cross-module dependencies and version evolution?Current solutions like AST-based analysis + traditional embeddings seem to miss crucial semantic contexts. Curious about others' experiences with hybrid approaches combining static analysis and lightweight ML models.

krawczstef5 个月前

+1 for vanilla code without LangChain.

评论 #42315097 未加载

评论 #42315398 未加载

评论 #42315003 未加载

dmezzetti5 个月前

Thanks for sharing.If you want notebooks that do some of this with local open models: <a href="https://github.com/neuml/txtai/tree/master/examples">https://github.com/neuml/txtai/tree/master/examples</a> and here: <a href="https://gist.github.com/davidmezzetti" rel="nofollow">https://gist.github.com/davidmezzetti</a>

Show HN: Open-Source Colab Notebooks to Implement Advanced RAG Techniques

5 条评论

Show HN: Open-Source Colab Notebooks to Implement Advanced RAG Techniques

5 条评论