Hey! I wanted to share a tool I've been working on. It's still very early and a work in progress, but I've found it incredibly helpful when working with Claude and OpenAI's models.<p>What it does:
I created a Python script that dumps your entire Git repository into a single file. This makes it much easier to use with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.<p>Key Features:
- Respects .gitignore patterns
- Generates a tree-like directory structure
- Includes file contents for all non-excluded files
- Customizable file type filtering<p>Why I find it useful for LLM/RAG:
- Full Context: It gives LLMs a complete picture of my project structure and implementation details.
- RAG-Ready: The dumped content serves as a great knowledge base for retrieval-augmented generation.
- Better Code Suggestions: LLMs seem to understand my project better and provide more accurate suggestions.
- Debugging Aid: When I ask for help with bugs, I can provide the full context easily.<p>How to use it:
Example: python dump.py /path/to/your/repo output.txt .gitignore py js tsx<p>Again, it's still a work in progress, but I've found it really helpful in my workflow with AI coding assistants (Claude/Openai). I'd love to hear your thoughts, suggestions, or if anyone else finds this useful!<p><a href="https://github.com/artkulak/repo2file">https://github.com/artkulak/repo2file</a><p>P.S. If anyone wants to contribute or has ideas for improvement, I'm all ears!
These are extremely common these days. Here are a few I've collected over the past few months:<p>- [files-to-prompt](<a href="https://github.com/simonw/files-to-prompt">https://github.com/simonw/files-to-prompt</a>) (from the GOAT simonw)<p>- [code2prompt](<a href="https://github.com/mufeedvh/code2prompt">https://github.com/mufeedvh/code2prompt</a>)<p>- <a href="https://gh-repo-dl.cottonash.com/" rel="nofollow">https://gh-repo-dl.cottonash.com/</a><p>- [1filellm](<a href="https://github.com/jimmc414/1filellm">https://github.com/jimmc414/1filellm</a>)<p>- [repopack](<a href="https://github.com/yamadashy/repopack">https://github.com/yamadashy/repopack</a>)<p>- [ingest](<a href="https://github.com/sammcj/ingest">https://github.com/sammcj/ingest</a>)<p>What makes yours better?
Take a look at what aider does to create a repo map using treesitter; <a href="https://aider.chat/docs/repomap.html" rel="nofollow">https://aider.chat/docs/repomap.html</a>
<a href="https://aider.chat/2023/10/22/repomap.html" rel="nofollow">https://aider.chat/2023/10/22/repomap.html</a><p>I guess the difference is that your script produces a complete copy, whereas aider uses a concise summary, necessary for when the context window is full
This is a similar tool I wrote for myself called "ingest". It ingests files/directories to LLM friendly markdown, estimates token usage, and can estimate vRAM usage for different models and quantisations and shows you a table highlighting which quantisation, context size and k/v cache quantisation will fit in a given (v)RAM size. - <a href="https://github.com/sammcj/ingest">https://github.com/sammcj/ingest</a>
Thats cool. I've used it. I'd add:<p>- treat '-' as stdout<p>- named arguments<p>- dont filter ignorefiles by checking they start with '.', cause it makes local .gitignore not being found, and treated as an extension :)
I schemed the readme, but did not see support for prefixing each line with line numbers, this is an absolute must have for people like me who have a workflow centered around generating git patchs. In my experience that gives generated patchs much more chances to be incorrect.
Nice. I have a few suggestions:<p>Put code blocks inside 3 ticks in the beginning and 3 ticks in the end since it's the default for each file.<p>Remove the dashes to save tokens.<p>In the title for the code blocks put the full relative path to the file since some projects have many files with the same name.
Made a similar one that's not super polished - <a href="https://github.com/VVoruganti/repo-to-prompt">https://github.com/VVoruganti/repo-to-prompt</a>
Interesting! There was another Show HN that did this same thing earlier in the day!<p><a href="https://news.ycombinator.com/item?id=41480373">https://news.ycombinator.com/item?id=41480373</a>
Something like this that could automatically scrape a set of url's into a file would also be useful for trying to learn how to use various terrible enterprise software applications (SAP).
made one as well with interactive selection and token counting
<a href="https://github.com/3rd/promptpack">https://github.com/3rd/promptpack</a>
There is an api for this at <a href="https://txtrepo.com" rel="nofollow">https://txtrepo.com</a>
I used it with n8n to create PRs on issues
Seems like a common itch to scratch and a good tool to scratch it with. I created 'linusfiles' and 'grabout' as tools with this. Grabout copies the last input and error message or other output to clipboard and linusfiles copies the tracked files to clipboard.<p>But I like the idea of tarballing it, as ndr_ suggested. I'm thinking that could be the move here.<p>In case anyone wanted to see my workflows <a href="https://github.com/atxtechbro/shell-tooling">https://github.com/atxtechbro/shell-tooling</a>