Code search is hard

302 点作者 stevekrouse大约 1 年前

41 条评论

sqs大约 1 年前

I'm at Sourcegraph (mentioned in the blog post). We obviously have to deal with massive scale, but for anyone starting out adding code search to their product, I'd recommend not starting with an index and just doing on-the-fly searching until that does not scale. It actually will scale well for longer than you think if you just need to find the first N matches (because that result buffer can be filled without needing to search everything exhaustively). Happy to chat with anyone who's building this kind of thing, including with folks at Val Town, which is awesome.

评论 #39995268 未加载

评论 #39996066 未加载

评论 #39996347 未加载

评论 #39997851 未加载

评论 #39996355 未加载

评论 #40002768 未加载

评论 #40004064 未加载

评论 #40003208 未加载

ayberk大约 1 年前

It indeed is hard, and a good code search platform makes life so much easier. If I ever leave Google, the internal code search is for sure going to be the thing I miss the most. It's so well integrated into how everything else works (blaze target finding, guice bindings etc), I can't imagine my life without it.I remember to appreciate it even more every time I use Github's search. Not that it's bad, it's just inherently so much harder to build a generalized code search platform.

评论 #39994712 未加载

评论 #39999027 未加载

hiAndrewQuinn大约 1 年前

Basic code searching skills seems like something new developers are never explicitly taught, but which is an absolutely crucial skill to build early on.I guess the knowledge progression I would recommend would look something kind this:- Learning about Ctrl+F, which works basically everywhere.- Transitioning to ripgrep <a href="https://github.com/BurntSushi/ripgrep">https://github.com/BurntSushi/ripgrep</a> - I wouldn't even call this optional, it's truly an incredible and very discoverable tool. Requires keeping a terminal open, but that's a good thing for a newbie!- Optional, but highly recommended: Learning one of the powerhouse command line editors. Teenage me recommended Emacs; current me recommends vanilla vim, purely because some flavor of it is installed almost everywhere. This is so that you can grep around and edit in the same window.- In the same vein, moving back from ripgrep and learning about good old fashioned grep, with a few flags rg uses by default: `grep -r` for recursive search, `grep -ri` for case insensitive recursive search, and `grep -ril` for case insensitive recursive "just show me which files this string is found in" search. Some others too, season to taste.- Finally hitting the wall with what ripgrep can do for you and switching to an actual indexed, dedicated code search tool.

评论 #39999673 未加载

评论 #39998468 未加载

评论 #39999403 未加载

bawolff大约 1 年前

Surprised that hound <a href="https://github.com/hound-search/hound">https://github.com/hound-search/hound</a> isn't mentioned. I thought it was the leader of open source solutions in this space.I've been using Wikimedia's instance ( <a href="https://codesearch.wmcloud.org/search/" rel="nofollow">https://codesearch.wmcloud.org/search/</a> ) and have generally been pretty happy with what it provides.

评论 #39995536 未加载

jillesvangurp大约 1 年前

It's why IDE and developer tool builders have long had the insight that in order to do code search properly, you need to open up the compiler platform as a lot of what you need to do boils down to reconstructing the exact same internal representations that a compiler would use. And of course good code search is the basis for refactoring support, auto completion, and other common IDE features.Easier said then done of course as tools are often an afterthought for compiler builders. Even Jetbrains made this mistake with Kotlin initially, which is something they are partially rectifying with Kotlin 2.0 now to make it easier to support things like incremental compilation. The Rust community had this insight as well with a big effort a few years ago to make Rust more IDE friendly.IBM actually nailed this with Eclipse back in the day and that hasn't really been matched since then. Intellij never even got close to this being 2-3 orders of magnitudes slower. We're talking seconds vs. milliseconds here. Eclipse had a blazing fast incremental compiler for Java that could even partially compile code in the presence of syntax errors. The IDEs representation of that code was hooked into that compiler.With Eclipse, you could introduce a typo and break part of your code and watch the IDE mark all the files that now had issues across your code base getting red squiggles instantly. Fix the typo and the squiggles went away, also without any delay.That's only possible if you have a mapping between those files and your syntax tree, which is exactly what Eclipse was doing because it was hooked into the incremental compiler.Intellij was never able to do this, it will actively lie to you about things being fine/not fine until you rebuild your code and it will show phantom errors a lot when it's internal state gets out of sync with what's on disk. It often requires full rebuilds to fix this. If you run something, there's a several second lag while it compiles things. Reason: the IDE internal state is calculated separately from the compiler and this gets out of sync easily. When you run something, it has to compile your code because it hasn't been compiled yet. That's often when you find out the IDE was lying to you about things being ready to run.With Eclipse all this was instantly and unambiguous because it shared the internal state with the compiler. If it compiled, your IDE would be error free, if it didn't it wouldn't be. And it compiled incrementally and really quickly so you would know instantly. It had many flaws and annoying bugs but that's a feature I miss.

评论 #39999531 未加载

评论 #40000196 未加载

评论 #39999965 未加载

Macha大约 1 年前

> This is a pretty bad index: it has words that should be stop words, like function, and won’t split a.toString() into two tokens because . is not a default word boundary.So github used to (maybe still does) "fix" this one and it's annoying. Although github are ramping up their IDE like find-usages, it's still not perfect, so somethings you just want to a text search equivalent for "foo.bar()" for all the uses it misses and this stemming behaviour then finds every while where foo and bar are mentioned which bloats results.

ricardobeat大约 1 年前

I don't understand their hand-waving of Zoekt. It was built exactly for this purpose, and is not a "new infrastructure commitment" any more than the other options. The server is a single binary, the indexer is also a single binary, can't get any simpler than that.To me it doesn't make sense to be more scared of it than Elasticsearch...

评论 #39995675 未加载

评论 #39995704 未加载

ivanovm大约 1 年前

One of the most interesting approaches to code search I've seen recently (no affiliation) <a href="https://github.com/pyjarrett/septum">https://github.com/pyjarrett/septum</a>The hardest part about getting code search right imo is grabbing the right amount of surrounding context, which septum is aimed at solving on a per-file basis.Another one I'm surprised hasn't been mentioned is stack-graphs (<a href="https://github.com/github/stack-graphs">https://github.com/github/stack-graphs</a>), which tries to incrementally resolve symbolic relationships across the whole codebase. It powers github's cross-file precise indexing and conceptually makes a lot of sense, though I've struggled to get the open source version to work

chasil大约 1 年前

Oracle has USER/ALL/DBA_SOURCE views, and all of the PL/SQL (SQL/PSM) code that has been loaded into the database is presented there. These are all cleartext visible unless they have been purposefully obfuscated.It has columns for the owner, object name, LINE[NUMBER] and TEXT[VARCHAR2(4000)] columns and you can use LIKE or regexp_like() on any of the retained source code.I wonder if EnterpriseDB implements these inside of Postgres, and/or if they are otherwise available as an extension.Since most of SQL/PSM came from Oracle anyway, these would be an obvious desired feature.<a href="https://en.wikipedia.org/wiki/SQL/PSM" rel="nofollow">https://en.wikipedia.org/wiki/SQL/PSM</a>

amarshall大约 1 年前

> GitHub’s search is excellentIs it? I find it near-useless most of the time, and cloning + ripgrep to be way more efficient. Perhaps the problem is more in the UX being awful than the actual search.

worldsayshi大约 1 年前

I suppose using something like tree sitter to get a consistent abstract syntax tree to work with would be a good starting point. And then try building a custom analyzer (if using elasticsearch lingo) with that?

评论 #39994902 未加载

评论 #39995618 未加载

boyter大约 1 年前

Code search is indeed hard. Stop words, stemming and such do rule out most off the shelf indexing solutions but you can usually turn them off. You can even get around the splitting issues of things like<pre><code> a.toString() </code></pre> With some pre-processing of the content. However were you really get into a world of pain is allowing someone to search for ring in the example. You can use partial term search, prefix, infix, or suffix but this massively bloats the index and is slow to run.The next thing you try is trigrams, and suddenly you have to deal with false positive matches. So you add a positional portion to your index, and all of a sudden the underlying index is larger than the content you are indexing.Its good fun though. For those curious about it I would also suggest reading posts by Michael Stapelberg <a href="https://michael.stapelberg.ch/posts/" rel="nofollow">https://michael.stapelberg.ch/posts/</a> who writes about Debian Code Search (which I believe he started) in addition to the other posts mentioned here. Shameless plug, I also write about this <a href="https://boyter.org/posts/how-i-built-my-own-index-for-searchcode/" rel="nofollow">https://boyter.org/posts/how-i-built-my-own-index-for-search...</a> where I go into some of the issues when building a custom index for searchcode.comOddly enough I think you can go a long way brute forcing the search if you don't do anything obviously wrong. For situations where you are only allowed to search a small portion of the content, say just your own (which looks applicable in this situation) that's what I would do. Adding an index is really only useful when you start searching at scale or you are getting semantic search out of it. For keywords which is what the article appears to be talking about, that's what I would be inclined to do.

评论 #40003393 未加载

sdesol大约 1 年前

> It’s hard to find any accounts of code-search using FTSI'm actually going to be doing this soon. I've thought about code search for close to a decade, but I walked away from it, because there really isn't a business for it. However, now with AI, I'm more interested in using it to help find relevant context and I have no reason to believe FTS won't work. In the past I used Lucene, but I'm planning on going all in with Postgres.The magic to fast code search (search in general), is keeping things small. As long as your search solution is context aware, you can easily leverage Postgres sharding to reduce index sizes. I'm a strong believer in "disk space is cheap, time isn't", which means I'm not afraid to create as many indexes as required, to shave 100's of milliseconds of searches.

评论 #39996146 未加载

campbel大约 1 年前

> Sourcegraph’s maintained fork of Zoekt is pretty cool, but is pretty fearfully niche and would be a big, new infrastructure commitment.I don't think Zoekt is as scary as this article makes it out to be. I set this up at my current company after getting experience with it at Shopify and its really great.

fizx大约 1 年前

There's a million paths, but here's one I like.Use ElasticSearch. It will scale more than Postgres. Three hosted options are AWS, Elastic, Bonsai. I founded Bonsai and retired (so am partial), but they will provide the best human support for you, and you won't have to worry about java Xmx.Your goal with ES is to use the Regex PatternAnalyzer to split the code into reasonable exact code-shaped tokens (not english words).Here's a rough GPT4 explanation with sample config that I'd head towards: <a href="https://chat.openai.com/share/e4d08586-b7ef-48f2-9de1-7f82ea3c1f14" rel="nofollow">https://chat.openai.com/share/e4d08586-b7ef-48f2-9de1-7f82ea...</a>

评论 #39997449 未加载

评论 #40004477 未加载

jackbravo大约 1 年前

Would LLM vector embeddings work in this context? I'm guessing they should since they are very good at understanding code.

评论 #40005710 未加载

评论 #39995129 未加载

kermatt大约 1 年前

Are any of the tools mentioned in these comments better suited to searching SQL code, both DML and DDL?We maintain a tree of files with each object in a separate "CREATE TABLE|VIEW|PROCEDURE|FUNCTION" script. This supports code search with grep, but something that could find references to an object when the name qualifications are not uniform would be very useful:INSERT INTO table INSERT INTO schema.table INSERT INTO database.schema.tableCan all be done with regex, but search is not so easy for programmers new to expressions.

bch大约 1 年前

Why am I not seeing anything here about ctags[0] or cscope[1]? Are they that out of fashion? cscope language comprehension appears limited to C/C++ and Java, but “ctags” (I think I use “uctags” atm) language support is quite broad and ubiquitous…[0] <a href="https://en.wikipedia.org/wiki/Ctags" rel="nofollow">https://en.wikipedia.org/wiki/Ctags</a>[1] <a href="https://en.wikipedia.org/wiki/Cscope" rel="nofollow">https://en.wikipedia.org/wiki/Cscope</a>

评论 #39999234 未加载

ethanwillis大约 1 年前

There are tools from bioinformatics that would be more applicable here for code search than the ones linguistics has made for searching natural language.

herrington_d大约 1 年前

Is it possible to combine n-gram and AST to dump a better indexing?Take `sourceCode.toString()` as an example, the AST can dump it to `sourceCode` and `toString`. A further indexer can break `sourceCode` to `source` and `code`.For ast dumping, project like <a href="https://github.com/ast-grep/ast-grep">https://github.com/ast-grep/ast-grep</a> can help.

评论 #39996392 未加载

peter_l_downs大约 1 年前

Surprised not to see Livegrep [0] on the list of options. Very well-engineered technology; the codebase is clean (if a little underdocumented on the architecture side) and you should be able to index your code without much difficulty. Built with Bazel (~meh, but useful if you don't have an existing cpp toolchain all set up) and there are prebuilt containers you can run. Try that first.By the way, there's a demo running here for the linux kernel, you can try it out and see what you think: <a href="https://livegrep.com/search/linux" rel="nofollow">https://livegrep.com/search/linux</a>EDIT: by the way, "code search" is deeply underspecified. Before trying to compare all these different options, you really would benefit from writing down all the different types of queries you think your users will want to ask, including why they want to run that query and what results they'd expect. Building/tuning search is almost as difficult a product problem as it is an engineering problem.[0] <a href="https://github.com/livegrep/livegrep">https://github.com/livegrep/livegrep</a>

评论 #39994937 未加载

ectopasm83大约 1 年前

>Lemmatization: some search indexes are even fancy enough to substitute synonyms for more common words, so that you can search for “excellent” and get results for documents including “great.”This isn't what lemmatization is about.Stemming the word ‘Caring‘ would return ‘Car‘. Lemmatizing the word ‘Caring‘ would return ‘Care‘.

nbenitezl大约 1 年前

Also <a href="https://github.com/Debian/dcs">https://github.com/Debian/dcs</a>

simonw大约 1 年前

A feature I'd appreciate from Val Town is the ability to point it to a GitHub repo that I own and have it write the source code for all of my Vals to that repo, on an ongoing basis.Then I could use GitHub code search, or even "git pull" and run ripgrep.

评论 #39997702 未加载

healeycodes大约 1 年前

When a val is deployed on val town, my understanding is that it's parsed/compiled. At that point, can you save the parts of the program that people might search for? Names of imports, functions, variables, comments, etc.

评论 #39996307 未加载

thesuperbigfrog大约 1 年前

OpenGrok (<a href="https://github.com/oracle/opengrok">https://github.com/oracle/opengrok</a>) is a wonderful tool to search a codebase.It runs on-prem and handles lots of popular programming languages.

评论 #39998500 未加载

civilized大约 1 年前

Is "hard" a bit of an overstatement for problems like "I'm using a library that mangles the query"? Couldn't you search for the literal text the user inputs? Maybe let them use regex?

hanwenn大约 1 年前

Hi,I wrote zoekt. From what I understand valtown does, I would try to use brute force first (ie. something equivalent to ripgrep). Once that starts breaking down, you could use last-updated-timestamps to reduce the brute force:* make a trigram index using Zoekt or Hound for JS snippets older than X * do brute force on snippets newer than X * advance X as you're indexing newer data.If the snippets are small, you can probably use a (trigram => snippets) index for space savings relative to a (trigram => offset) index.

nox101大约 1 年前

any nuggets here?<a href="https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/" rel="nofollow">https://github.blog/2023-02-06-the-technology-behind-githubs...</a>

评论 #39994605 未加载

评论 #39996425 未加载

metalrain大约 1 年前

I think you need to parse the code and build AST to make good search. Even then normalizing over different aliases, may not be simple.

评论 #39999560 未加载

semiquaver大约 1 年前

Be careful with trigram indexes. At least in the postgres 10 era they caused severe index bloat for frequently updated tables.

评论 #39994541 未加载

nojvek大约 1 年前

You can do to_tsvector “plain” and keep the strings intact. No lemming, stemming.We use plain tsvectors on a gin index and change the queries to allow prefix based searching. So “wo he” matches “hello world”.Perhaps I should write a blog about it. Took me a few days to read PG documentation to get where we are at.The only thing it doesn’t handle is typo tolerance.

评论 #40004489 未加载

philippemnoel大约 1 年前

ParadeDB founder here. We'd love to be supported on Render, if the Render folks are open to it...

johnthescott大约 1 年前

the rum index has worked well for us on roughly 1TB of pdfs. written by postgrespro, same folks who wrote core text search and json indexing. not sure why rum not in core. we have no problems.<pre><code> https://github.com/postgrespro/rum</code></pre>

评论 #40004513 未加载

727564797069706大约 1 年前

If you're serious about scaling up, definitely consider Vespa (<a href="https://vespa.ai" rel="nofollow">https://vespa.ai</a>).At serious scale, Vespa will likely knock all the other options out of the park.

jessemhan大约 1 年前

Good scalable codebase search is tough. We built a scalable, fast, and super simple solution for codebase semantic search: <a href="https://phorm.ai" rel="nofollow">https://phorm.ai</a>

louiskw大约 1 年前

<a href="https://github.com/BloopAI/bloop">https://github.com/BloopAI/bloop</a> Is fully open source and has full text + regex search built on tantivy fyi

pomdtr大约 1 年前

Hey! I'm a val.town fanboy and I immediately thought about a workaround while reading the blog post:What if I dumped every publics vals in Github, in order to be able to user their (awesome) search ?So here is my own "Val Town Search": <a href="https://val-town-search.pomdtr.me" rel="nofollow">https://val-town-search.pomdtr.me</a>And here is the repo containing all vals, updated hourly thanks to a github action: <a href="https://github.com/pomdtr/val-town-mirror">https://github.com/pomdtr/val-town-mirror</a>

评论 #39996714 未加载

评论 #40012351 未加载

评论 #39996824 未加载

评论 #39997708 未加载

skybrian大约 1 年前

It seems like some of their gists have documentation attached and maybe that’s enough? I’m not sure I’m all that interested in seeing undocumented gists in search results.

reeyadalli大约 1 年前

I have never actually given it much thought about the difference between code search and normal "literature". Interesting read!!

IshKebab大约 1 年前

I would use Hound.