Cool project! Just trying it out now - does it support CUDA acceleration? I'm running it on a rather large project and it claims it's got over 140k "tasks left in the queue", and I see no indicator of activity on nvidia-smi.
Looks very neat! Currently processing the repo I'm working on.<p>Can the generated database be easily shared within the team so not everyone has to run the initial processing of the repo which seems that it will take a couple of hours on my laptop?
Neat AI app!<p>1. What feature extractor is used to derive code embeddings?<p>2. Would support for more complex queries be useful inside the app?<p><pre><code> --- Retrieve a subset of code snippets
SELECT name
FROM snippets
WHERE file_name LIKE "%py" AND author_name LIKE "John%"
ORDER BY
Similarity(
CodeFeatureExtractor(Open(query)),
CodeFeatureExtractor(data)
)
LIMIT 5;</code></pre>
I've been test driving a similar one <a href="https://github.com/sturdy-dev/semantic-code-search">https://github.com/sturdy-dev/semantic-code-search</a><p>But yours has a more permissive license!<p>I also had to modify it a bit to allow for the line endings I needed and it frustratingly doesn't allow specifying a path, and often returns tests instead of code
My work has 10ish repos we use, looks like this needs to be run in a specific git repo. Is there a way for this tool to run in a parent directory that contains all the repos we use with the same functionality?
Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?
if the code doesn't contain comments, can it still work?<p>will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?
Hey OP, this looks awesome!<p>I've done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos/negs.<p>*There are 2 reasons:*<p>1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It'll be incredibly high! Even for gibberish and sensical sentences.<p>2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.<p>*There are solutions I've tried with mixed results:*<p>1. embedding "whitening" forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.<p>2. train a super light neural net on your codebase's embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase's embeddings.<p>*There are solutions from the literature I am working on next that I find conceptually more promising:*<p>1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.<p>2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.<p>3. We use the AST of the code to further inform embeddings.<p>I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I'm actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by <i>us</i>, we do that through these local-first projects and building strong communities.<p>[1] <a href="https://github.com/freckletonj/uniteai">https://github.com/freckletonj/uniteai</a>
I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.
I'm looking forward to playing a little experiment with this: I'm going to run this on the Linux kernel tree, sight unseen, and knowing nothing about the structure of the Linux kernel – will it help me navigate it for the first time?<p>Edit: processing chunks; see you tomorrow...