Show HN: GPT Repo Loader – load entire code repos into GPT prompts

373 pointsby mpoonabout 2 years ago

I was getting tired of copy/pasting reams of code into GPT-4 to give it context before I asked it to help me, so I started this small tool. In a nutshell, gpt-repository-loader will spit out file paths and file contents in a prompt-friendly format. You can also use .gptignore to ignore files/folders that are irrelevant to your prompt.gpt-repository-loader as-is works pretty well in helping me achieve better responses. Eventually, I thought it would be cute to load itself into GPT-4 and have GPT-4 improve it. I was honestly surprised by PR#17. GPT-4 was able to write a valid an example repo and an expected output and throw in a small curveball by adjusting .gptignore. I did tell GPT the output file format in two places: 1.) in the preamble when I prompted it to make a PR for issue #16 and 2.) as a string in gpt_repository_loader.py, both of which are indirect ways to infer how to build a functional test. However, I don't think I explained to GPT in English anywhere on how .gptignore works at all!I wonder how far GPT-4 can take this repo. Here is the process I'm following for developing:- Open an issue describing the improvement to make- Construct a prompt - start with using gpt_repository_loader.py on this repo to generate the repository context, then append the text of the opened issue after the --END-- line.- Try not to edit any code GPT-4 generates. If there is something wrong, continue to prompt GPT to fix whatever it is.- Create a feature branch on the issue and create a pull request based on GPT's response.- Have a maintainer review, approve, and merge.I am going to try to automate the steps above as much as possible. Really curious how tight the feedback loop will eventually get before something breaks!

30 comments

ftufekabout 2 years ago

This is awesome, can't wait to get api access to the 32k token model. Rather than this approach of just converting the whole repo to a text file, what I'm thinking is, you can let the model decide the most relevant files.The initial prompt would be, "person wants to do x, here are the file list of this repo: ...., give me a list of files that you'd want to edit, create or delete" -> take the list, try to fit the contents of them into 32k tokens and re-prompt with "user is trying to achieve x, here's the most relevant files with their contents:..., give me a git commit in the style of git patch/diff output". From playing around with it today, I think this approach would work rather well and can be like a huge step up from AI line autocompletion.

评论 #35196717 未加载

评论 #35194395 未加载

评论 #35192180 未加载

评论 #35193533 未加载

评论 #35194907 未加载

评论 #35195289 未加载

评论 #35192397 未加载

xwdvabout 2 years ago

Eventually you get to a point where the AI simply doesn’t know how to do something correctly. It can’t fix a certain bug, or implement a feature correctly. At this point you are left trying to do more and more prompt crafting… or you can just fix the problem yourself. If you can’t do it yourself, you’re screwed.I wonder if the future will just be software hobbled together with shitty AI code that no one understands, with long loops and deep call stacks and abstractions on top of abstractions, while tech priests take a prompt and pray approach to eventually building something that does kind of what they want.Or to hell with priests! Build some temple where users themselves can come leave prompts for the general AI to hear and maybe put out a fix for some app running on their tablets devices.

评论 #35191941 未加载

评论 #35191889 未加载

评论 #35192394 未加载

评论 #35195119 未加载

jerpintabout 2 years ago

“This repo is GPL-v3 licensed. Rewrite it while preserving its main functionality”

评论 #35192836 未加载

评论 #35192931 未加载

评论 #35192758 未加载

DaiPlusPlusabout 2 years ago

Rather than prompting GPT into implementing a solution, can we prompt it to try to preemptively find issues with the codebase or missing-functionality?Also, do we know what languages GPT-4 "understands" at a sufficient level? What knowledge does it have of post-2021 language features, like in C23?

评论 #35191607 未加载

评论 #35191620 未加载

adltereturnabout 2 years ago

I am skeptical about using the method of generating a large amount of repository data and sending it, because there may be too many files in the repository. I think a better approach might be for OpenAI to open an interface for transferring GIT repositories, and then let OpenAI analyze the repository data, which is similar to what [chatpdf](<a href="https://www.chatpdf.com/" rel="nofollow">https://www.chatpdf.com/</a>) is doing.

EGregabout 2 years ago

How much text can you feed GPT-4?Our codebase is 1 million lines of code.Can we feed the documentation to it? What are the limits?Is it possible to train it on our data without doing prompt engineering? How?Otherwise are we supposed to use embeddings? Can someone explain how these all work and the tradeoffs?

评论 #35192185 未加载

评论 #35191545 未加载

评论 #35191515 未加载

评论 #35192751 未加载

评论 #35192788 未加载

评论 #35193835 未加载

waynenilsenabout 2 years ago

Fully automated junior swe but on hyper speed. Natural next step

评论 #35192327 未加载

swyxabout 2 years ago

61 LOC for implementation, 42 LOC for tests.this repo currently has more HN upvotes than LOC.very high leverage code!

评论 #35193148 未加载

评论 #35193241 未加载

m3kw9about 2 years ago

I’m thinking if GPT can write entire programs professionally and iterate, it would be OpenAI that benefit the most and would be a guarded asset I.e not open to public till it’s safe for release. Most of us probably don’t need to worry about work after that as that could well be AGI

评论 #35193622 未加载

bebrwsabout 2 years ago

From what I understand this seems useful if you have a model that will accept a large or unlimited number of tokens. I was looking into doing the same thing with ChatGPT and went with ada to find snippets related to the prompt and then to include those with a prompt to ChatGPT: <a href="https://bbarrows.com/posts/using-embeddings-ada-and-chatgpt-to-search-and-query-a-new-codebase" rel="nofollow">https://bbarrows.com/posts/using-embeddings-ada-and-chatgpt-...</a>Does ChatGPT 4 now accept more tokens maybe?

评论 #35192767 未加载

irgolicabout 2 years ago

So I have been toying with an AutoPR GitHub Action for a bit but seeing this spurred me to actually put it into a format for y'all to try. It uses GPT-4 to automatically generate code based on issues in your repo and opens pull requests for those changes.<a href="https://github.com/irgolic/AutoPR/">https://github.com/irgolic/AutoPR/</a>What AutoPR can do:- Automatically generates code based on issues in your repo. - Opens PRs for the generated code, making it easy to review and merge changes.By using this GitHub Action, you can skip the manual steps of prompt design, and let the AI handle code generation and PR creation for you.Here's how I used it to make the license for itself: <a href="https://github.com/mpoon/gpt-repository-loader/issues/23">https://github.com/mpoon/gpt-repository-loader/issues/23</a>Feel free to give it a try, and I would love to hear your feedback! It's still in alpha and it works for straightforward tasks, but I have a plan to scale it up. Let's see how far we can push GPT-4 and create a more efficient development process together.

kanyethegreatabout 2 years ago

nice. general question: how many lines of code (at 120 char col len) could you send in one prompt?also, the entire thing is literally 60 lines of python. sometimes i don't get what gets upvoted on HN anymore

评论 #35194062 未加载

joenot443about 2 years ago

So I just tried it out. The output.txt which was generated was... 375mb of mostly binary junk. I'm a lazy lark and have lots of nonsense in this repo which I shouldn't, but I was hoping the tool might be able to detect which files are "meaningful" or not.I tried again after updating the script to accept a .gptinclude file (this functionality was entirely added by one GPT-4 query). This time the output file was a much more acceptable 744kb.Now upon hitting the actual API, I'm being informed that there's a token limit of 4096 (which I wasn't aware of and isn't mentioned in the repo).Doesn't that really severely limit the usefulness? What good is it uploading a repo if you're only limited to 4096 words? That's scarcely a couple files!Sort of wish I hadn't spent time on this - I feel like in theory it's a nice idea, but so limited in practice that I don't see it being useful for anyone working on something meaningful.

underlinesabout 2 years ago

Im no expert, but wouldn't it make more sense to give the repo-context (structure, source code, PRs, Issues, ...) as embeddings? You could use langchain to generate an embedding and send it through the API, like explained here [1]. It then should have access to the context at inference time, which as I understand is better than loading the context in the prompt which wastes tokens / max. output length and has a limit.1 <a href="https://www.youtube.com/watch?v=veV2I-NEjaM">https://www.youtube.com/watch?v=veV2I-NEjaM</a>

j0hannesabout 2 years ago

In an ideal world, you would be able to have a contextId that you would pass to OpenAI prompt calls. And be able to manage that context separately. So you would pass it code files (with expiration dates) And you could also provide a list conversationIDs, so when providing an answer for a particular prompt request, GPT knows what previous prompts and responses to consider.As of right now, I've never used the API as a developer, but I've heard that you have to provide the ENTIRE context with EVERY prompt request. How do you work around that?

textninjaabout 2 years ago

I’m curious to see how this turns out as well, though you’ll probably have to devise a workaround for the token limit for this to be effective for all but the smallest projects.

andreyvitabout 2 years ago

Hey, I got inspired by this and built <a href="https://github.com/andreyvit/aidev">https://github.com/andreyvit/aidev</a>, it sends a slice of repo to OpenAI with a prompt, and saves the results back into files. It's in Go, and has built large chunks of itself. (As one of my friends said, that gives “self-documenting code” a totally new meaning.)

amrbabout 2 years ago

This will fail to get the commit history over time, as it's just reads files in the directory.If this was using the git library and <a href="https://langchain.readthedocs.io/en/latest/reference/modules/vectorstore.html" rel="nofollow">https://langchain.readthedocs.io/en/latest/reference/modules...</a> it would be a more complete solution.

fireabout 2 years ago

now that's a name I haven't seen in years, hi mpoon!also this is slick as hell

评论 #35191731 未加载

评论 #35192174 未加载

bob1029about 2 years ago

Seeing a lot of comments in here about the token limits.Another path you can take is to fine tune a model on your business. Each training item has to fit within the token limit, but you can send hundreds of megs of these for training.It's more expensive to run a FT model, but you don't have to include any prior context (assuming it's common to all prompts).

评论 #35194743 未加载

hackernewdsabout 2 years ago

isn't this a massive privacy violation? any employer would most likely not be okay with this.

评论 #35194613 未加载

评论 #35194109 未加载

评论 #35194600 未加载

printvoidabout 2 years ago

Maybe I didn't understand the usage of this but how is this output.txt file that this repo generates to be used and provided as an input to chatgpt. Can someone eloborate this for me?

eddsh1994about 2 years ago

That’s cool! I tried to get GPT to build a todo list from react to sql and the docker file but got a little stuck, I’ll try with 4 when I use it next

somid3about 2 years ago

Or you can just type into ChatGPT4 the following prompt: "How can I load a GitHub repository into ChatGPT?"

tflintonabout 2 years ago

Before anyone working on commercial code bases thinks to use this, stop. Uploaded code becomes part of OpenAI.

评论 #35192354 未加载

评论 #35192269 未加载

评论 #35192661 未加载

评论 #35193592 未加载

megablastabout 2 years ago

Did you ask GPT to write the loader?

评论 #35192454 未加载

darshanpsabout 2 years ago

Would GitHub copilot not solve the repository load problem?

yaantcabout 2 years ago

Am I missing something? From what I understood from Wolfram description of GPT and GPT in 60 lines of Python, a GPT model's only memory is the input buffer. So 4k token for GPT3, some more but still limited for GPT4.To summarize the GPT inference process as I understood it, with GPT3 as example:1) the input buffer is made of 4k token. There are about 50k token. So the input is a vector of token ids. We can see it as a point in a high dimensional space;2) The core neural network is a pure function: for such an input point, it will return an output vector as large as there are token. So here, a 50k element vector, where each entry is the probability that the associated token is the next element.The very important thing here is that the whole neural network is a pure function: same input, same output. With immensely large super fast memory this function could be implemented as a look-up table, from an input point (buffer) to an output probability vector. No memory, no side effect here.3) The probability vector is fed into a "next token" function. It doesn't just take the highest probability token (boring result), but use a "temperature" to randomize a bit, while using the output probabilities;4) The next token chosen is inserted into the input buffer, keeping the same total number of token. Go back to (1) until a "stop" token is selected at (3).So in effect, the whole process is a function from a point to a point. "point" here is the buffer seen as a (high dimensional) vector, so a point in a high dimension space. The generation process is in effect a walk in this "buffer space". Prompting puts the model into some part of the state, with some semantic relation to the prompt semantic content (that's the magic part). Then generation is a walk in this space, with a purely deterministic part (2) and a bit of randomization (3) to make the walk trajectory (and its meaning, which is what we care about) more interesting to us.So if this is correct, there is no point in injecting a lot of data into a GPT model: the output is defined by the input buffer size. Just input the last 4k token (for GPT3, more for GPT4) and you're done: everything else would have disappeared. So here, just input the last 4k token of a repo and save some money ;)To avoid this limitation, one would have to summarize the previous input, and make this summary part of the current input buffer. This is what chaining is all about if I understood correctly. But I don't see chaining here.Sooo... Am I missing something? Or is the author of this script the one missing something? I don't mind it either way, but I'd appreciate some clarification from knowledgeable people ;)Thanks

yangjunyuabout 2 years ago

Seems to be very useful! Thanks for making it!

samstaveabout 2 years ago

Should call it "Repo Depot"