Hi HN, We're excited to share repo2vec: a simple-to-use, modular library enabling you to chat with any public or private codebase. It's like Github Copilot but with the most up-to-date information about your repo.<p>We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.<p>We tried to make it dead-simple to use. With two scripts, you can index and get a functional interface for your repo. Every generated response shows where in the code the context for the answer was pulled from.<p>We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.<p>If you want to see a hosted version of the chat interface with its features, here's a link: <a href="https://www.youtube.com/watch?v=CNVzmqRXUCA" rel="nofollow">https://www.youtube.com/watch?v=CNVzmqRXUCA</a><p>We would love your feedback!<p>- Mihail and Julia
Very useful! I was just thinking this kind of thing should exist!<p>I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.
Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?
We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?
I wonder if it will work on <a href="https://github.com/organicmaps/organicmaps">https://github.com/organicmaps/organicmaps</a><p>So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.