TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: SeaGOAT – local, “AI-based” grep for semantic code search

240 pointsby kantordover 1 year ago

18 comments

GranPCover 1 year ago
Cool project! Just trying it out now - does it support CUDA acceleration? I'm running it on a rather large project and it claims it's got over 140k "tasks left in the queue", and I see no indicator of activity on nvidia-smi.
smoeover 1 year ago
Looks very neat! Currently processing the repo I&#x27;m working on.<p>Can the generated database be easily shared within the team so not everyone has to run the initial processing of the repo which seems that it will take a couple of hours on my laptop?
评论 #37591116 未加载
jarulrajover 1 year ago
Neat AI app!<p>1. What feature extractor is used to derive code embeddings?<p>2. Would support for more complex queries be useful inside the app?<p><pre><code> --- Retrieve a subset of code snippets SELECT name FROM snippets WHERE file_name LIKE &quot;%py&quot; AND author_name LIKE &quot;John%&quot; ORDER BY Similarity( CodeFeatureExtractor(Open(query)), CodeFeatureExtractor(data) ) LIMIT 5;</code></pre>
评论 #37583492 未加载
评论 #37591758 未加载
artisanspamover 1 year ago
What are the limitations on what languages this supports?
评论 #37583434 未加载
jasonjmcgheeover 1 year ago
I&#x27;ve been test driving a similar one <a href="https:&#x2F;&#x2F;github.com&#x2F;sturdy-dev&#x2F;semantic-code-search">https:&#x2F;&#x2F;github.com&#x2F;sturdy-dev&#x2F;semantic-code-search</a><p>But yours has a more permissive license!<p>I also had to modify it a bit to allow for the line endings I needed and it frustratingly doesn&#x27;t allow specifying a path, and often returns tests instead of code
评论 #37587587 未加载
hackncheeseover 1 year ago
My work has 10ish repos we use, looks like this needs to be run in a specific git repo. Is there a way for this tool to run in a parent directory that contains all the repos we use with the same functionality?
评论 #37591135 未加载
评论 #37584844 未加载
m3kw9over 1 year ago
Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?
hollowpythonover 1 year ago
Does anyone know a tool like this but for arbitrary PDFs?
评论 #37584254 未加载
评论 #37585349 未加载
评论 #37584338 未加载
eddywebsover 1 year ago
Cool beans! Does it work with python based codebase only or other could use it too ? Like java c#<p>Thank you for sharing.
la64710over 1 year ago
Just curious , did you use any LLM to generate code for this? BTW really awesome work!
retrofuturismover 1 year ago
This would make a useful (nvim) Telescope plugin. Looks super interesting.
billconanover 1 year ago
if the code doesn&#x27;t contain comments, can it still work?<p>will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?
ithkuilover 1 year ago
Interesting.<p>What would it take to support other programming languages?
nat0704over 1 year ago
Nice! Will try this out
freckletonjover 1 year ago
Hey OP, this looks awesome!<p>I&#x27;ve done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos&#x2F;negs.<p>*There are 2 reasons:*<p>1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It&#x27;ll be incredibly high! Even for gibberish and sensical sentences.<p>2. The dataset these SentenceTransformers are trained on don&#x27;t include much code, and certainly not intentionally. At least I haven&#x27;t found a code focused one yet.<p>*There are solutions I&#x27;ve tried with mixed results:*<p>1. embedding &quot;whitening&quot; forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.<p>2. train a super light neural net on your codebase&#x27;s embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase&#x27;s embeddings.<p>*There are solutions from the literature I am working on next that I find conceptually more promising:*<p>1. Chunk the codebase, and ask an LLM on each chunk to &quot;generate a question to which this code is the answer&quot;. Then do natural language lookup on the question, and return the code for it.<p>2. You have your code lookup query. Ask an LLM to &quot;generate a fabricated answer to this question&quot;. Then embed it&#x27;s answer, and use that to do your lookup.<p>3. We use the AST of the code to further inform embeddings.<p>I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I&#x27;m actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by <i>us</i>, we do that through these local-first projects and building strong communities.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;freckletonj&#x2F;uniteai">https:&#x2F;&#x2F;github.com&#x2F;freckletonj&#x2F;uniteai</a>
评论 #37587582 未加载
FloatArtifactover 1 year ago
I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.
评论 #37583844 未加载
评论 #37585318 未加载
评论 #37583826 未加载
评论 #37587595 未加载
nxobjectover 1 year ago
I&#x27;m looking forward to playing a little experiment with this: I&#x27;m going to run this on the Linux kernel tree, sight unseen, and knowing nothing about the structure of the Linux kernel – will it help me navigate it for the first time?<p>Edit: processing chunks; see you tomorrow...
评论 #37700242 未加载
评论 #37591209 未加载
MisterTeaover 1 year ago
Is the naming coincidence or some sort of strange homage because I can&#x27;t help thinking GOATsea.