TechEcho

18 comments

GranPCover 1 year ago

Cool project! Just trying it out now - does it support CUDA acceleration? I'm running it on a rather large project and it claims it's got over 140k "tasks left in the queue", and I see no indicator of activity on nvidia-smi.

smoeover 1 year ago

Looks very neat! Currently processing the repo I'm working on.Can the generated database be easily shared within the team so not everyone has to run the initial processing of the repo which seems that it will take a couple of hours on my laptop?

评论 #37591116 未加载

jarulrajover 1 year ago

Neat AI app!1. What feature extractor is used to derive code embeddings?2. Would support for more complex queries be useful inside the app?<pre><code> --- Retrieve a subset of code snippets SELECT name FROM snippets WHERE file_name LIKE "%py" AND author_name LIKE "John%" ORDER BY Similarity( CodeFeatureExtractor(Open(query)), CodeFeatureExtractor(data) ) LIMIT 5;</code></pre>

评论 #37583492 未加载

评论 #37591758 未加载

artisanspamover 1 year ago

What are the limitations on what languages this supports?

评论 #37583434 未加载

jasonjmcgheeover 1 year ago

I've been test driving a similar one <a href="https://github.com/sturdy-dev/semantic-code-search">https://github.com/sturdy-dev/semantic-code-search</a>But yours has a more permissive license!I also had to modify it a bit to allow for the line endings I needed and it frustratingly doesn't allow specifying a path, and often returns tests instead of code

评论 #37587587 未加载

hackncheeseover 1 year ago

My work has 10ish repos we use, looks like this needs to be run in a specific git repo. Is there a way for this tool to run in a parent directory that contains all the repos we use with the same functionality?

评论 #37591135 未加载

评论 #37584844 未加载

m3kw9over 1 year ago

Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?

hollowpythonover 1 year ago

Does anyone know a tool like this but for arbitrary PDFs?

评论 #37584254 未加载

评论 #37585349 未加载

评论 #37584338 未加载

eddywebsover 1 year ago

Cool beans! Does it work with python based codebase only or other could use it too ? Like java c#Thank you for sharing.

la64710over 1 year ago

Just curious , did you use any LLM to generate code for this? BTW really awesome work!

retrofuturismover 1 year ago

This would make a useful (nvim) Telescope plugin. Looks super interesting.

billconanover 1 year ago

if the code doesn't contain comments, can it still work?will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?

ithkuilover 1 year ago

Interesting.What would it take to support other programming languages?

nat0704over 1 year ago

Nice! Will try this out

freckletonjover 1 year ago

Hey OP, this looks awesome!I've done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos/negs.*There are 2 reasons:*1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It'll be incredibly high! Even for gibberish and sensical sentences.2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.*There are solutions I've tried with mixed results:*1. embedding "whitening" forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.2. train a super light neural net on your codebase's embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase's embeddings.*There are solutions from the literature I am working on next that I find conceptually more promising:*1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.3. We use the AST of the code to further inform embeddings.I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I'm actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by us, we do that through these local-first projects and building strong communities.[1] <a href="https://github.com/freckletonj/uniteai">https://github.com/freckletonj/uniteai</a>

评论 #37587582 未加载

FloatArtifactover 1 year ago

I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.

评论 #37583844 未加载

评论 #37585318 未加载

评论 #37583826 未加载

评论 #37587595 未加载

nxobjectover 1 year ago

I'm looking forward to playing a little experiment with this: I'm going to run this on the Linux kernel tree, sight unseen, and knowing nothing about the structure of the Linux kernel – will it help me navigate it for the first time?Edit: processing chunks; see you tomorrow...

评论 #37700242 未加载

评论 #37591209 未加载

MisterTeaover 1 year ago

Is the naming coincidence or some sort of strange homage because I can't help thinking GOATsea.

18 comments

GranPCover 1 year ago

smoeover 1 year ago

评论 #37591116 未加载

jarulrajover 1 year ago

评论 #37583492 未加载

评论 #37591758 未加载

artisanspamover 1 year ago

What are the limitations on what languages this supports?

评论 #37583434 未加载

jasonjmcgheeover 1 year ago

评论 #37587587 未加载

hackncheeseover 1 year ago

评论 #37591135 未加载

评论 #37584844 未加载

m3kw9over 1 year ago

Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?

hollowpythonover 1 year ago

Does anyone know a tool like this but for arbitrary PDFs?

评论 #37584254 未加载

评论 #37585349 未加载

评论 #37584338 未加载

eddywebsover 1 year ago

Cool beans! Does it work with python based codebase only or other could use it too ? Like java c#Thank you for sharing.

la64710over 1 year ago

Just curious , did you use any LLM to generate code for this? BTW really awesome work!

retrofuturismover 1 year ago

This would make a useful (nvim) Telescope plugin. Looks super interesting.

billconanover 1 year ago

if the code doesn't contain comments, can it still work?will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?

ithkuilover 1 year ago

Interesting.What would it take to support other programming languages?

nat0704over 1 year ago

Nice! Will try this out

freckletonjover 1 year ago

评论 #37587582 未加载

FloatArtifactover 1 year ago

I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.

评论 #37583844 未加载

评论 #37585318 未加载

评论 #37583826 未加载

评论 #37587595 未加载

nxobjectover 1 year ago

评论 #37700242 未加载

评论 #37591209 未加载

MisterTeaover 1 year ago

Is the naming coincidence or some sort of strange homage because I can't help thinking GOATsea.

Show HN: SeaGOAT – local, “AI-based” grep for semantic code search

18 comments

Show HN: SeaGOAT – local, “AI-based” grep for semantic code search

18 comments