The technology behind GitHub’s new code search

750 pointsby joshbetzover 2 years ago

44 comments

ezekgover 2 years ago

I use their new code search a lot to grok how people use certain features, or implement certain things. But I do wish there was a way to filter out forks. Sometimes I search a string and just get a bunch of forks all with the same result. For example, searching a common class in a Rails app often just shows a bunch of rails/rails forks, which is a lot of noise to sift through when you're trying to see how devs commonly use a certain feature.

评论 #34681633 未加载

评论 #34681958 未加载

评论 #34682533 未加载

评论 #34682306 未加载

评论 #34684006 未加载

评论 #34686354 未加载

评论 #34684347 未加载

评论 #34686077 未加载

评论 #34686383 未加载

评论 #34683311 未加载

andrewmcwattersover 2 years ago

> Just use grep? First though, let’s explore the brute force approach to the problem. We get this question a lot: “Why don’t you just use grep?” To answer that, let’s do a little napkin math using ripgrep on that 115 TB of content. On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.But you don't NEED to do this do you? I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.Maybe, maybe on a really serious day do I need to search an entire organization. But hardly ever.I have never, in over a decade ever, wanted sophisticated symbolic searching from GitHub code search, I just need remote grep.Why is the code search not feature bisected into this 99% use case, and then the occasional global repository search, which can behave entirely differently?

评论 #34683783 未加载

评论 #34684279 未加载

评论 #34684782 未加载

评论 #34684797 未加载

评论 #34684108 未加载

评论 #34687245 未加载

评论 #34686213 未加载

评论 #34702051 未加载

PaulHouleover 2 years ago

My beef with GitHub's code search is that it doesn't distinguish between the definition of a symbol and the uses of the symbol, so you need to wade through 5 pages of results to get the one result you're looking for. I would contrast that to my IDE which usually scores a direct hit if I enter a search in the right box.The indexing they talk about in that article seems like rearranging the deck chairs on the Titanic so far as that is concerned.

评论 #34689017 未加载

评论 #34689558 未加载

评论 #34685220 未加载

评论 #34688857 未加载

kirillbobyrevover 2 years ago

This is exciting! I see a lot of familiar pieces here that propagated from Google's Code Search and I know few people from Code Search went to GitHub, probably specifically to work on this. I always wondered why GitHub didn't invest into a decent code searching features, but I'm happy it finally gets to the State of the Art one step at a time. Some of the folks going to GitHub to work on this I know are just incredible and I have no doubt GitHub's code search will be amazing.I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).Here is the implementation:<a href="https://github.com/llvm/llvm-project/tree/main/clang-tools-extra/clangd/index/dex">https://github.com/llvm/llvm-project/tree/main/clang-tools-e...</a>I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:<a href="https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit" rel="nofollow">https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...</a>

Scaevolusover 2 years ago

As a comparison to Sourcegraph: Sourcegraph shards and indexes a repository at a time, and uses trigrams and bloom filters (to skip shards).Github shards and indexes individual files according to their hashes. It also uses variable length ngrams (neat!). This makes horizontal scaling simpler, but also means more of the index needs to be scanned for org/repo-scoped queries ("Due to our sharding strategy, a query request must be sent to each shard in the cluster.").

评论 #34689494 未加载

boyterover 2 years ago

The sparse grams solution to deal with stupidly common ngrams such as for or tes is very interesting.I’d love to see more discussion on how they are dealing with the false positives though. It looks like a positional index is being used to achieve this, but that usually blows out your index size.Additional information about deduplication would be especially interesting to me as well. It seems to solve this quite well. I usually try a search of Jquery to test this and it does not return multiple copies of different versions of it which is a good indicator that it’s slightly fuzzy.What I find really interesting about all the code search engines I know of is that each one implemented its own index. Nobody is using off the shelf software for this. I suspect that might be down to no off the shelf software providing a decent enough solution, and none providing a solution that scales. At least none that scales with decent costs.I did a small comparison of GitHub code search a while ago <a href="https://twitter.com/boyter/status/1480667185475244036?s=61&t=nfND46d9rReCju7-aw457Q" rel="nofollow">https://twitter.com/boyter/status/1480667185475244036?s=61&t...</a> But I should note a lot has improved since then, and it looks like sourcegraph now also does default AND of terms rather than exact match, so my complaints there are resolved.Impressive work by GitHub. I am sure some of the people behind it will read this comment, let me say well done to you all. I am very impressed. Also please post more information like this. There is so little out there.

评论 #34683714 未加载

评论 #34682542 未加载

评论 #34689970 未加载

ZephyrBluover 2 years ago

I really hope they release this soon and that it’s actually good.The current search sucks ass, you can’t find anything.I was trying to search for something in the WebKit source the other day and I had to use Sourcegraph because the GitHub search gave me zero results.

评论 #34681661 未加载

评论 #34690967 未加载

colin353over 2 years ago

Hey everyone, I'm Colin from GitHub's code search team: happy to answer any questions people have about it. Also, you can sign up to get access here: <a href="https://github.com/features/code-search">https://github.com/features/code-search</a>

评论 #34689360 未加载

评论 #34690492 未加载

simonwover 2 years ago

I really appreciate that this includes details about how search permissions work - how they ensure that search results include data from my private repos.I'd always wondered how they implemented that: it turns out they add extra internal filters to their searches along the lines of "RepoIDs(...) or PublicRepo".Question for the team: Do you have an additional permission check in the view layer before the results are shown to the end-user? I worry that if I switch a repo from public to private it may take a while for the code search index to catch up to the new permissions.

评论 #34682234 未加载

saagarjhaover 2 years ago

I’ve been using the new code search for a couple of months and I like it, but the UI is kind of antagonistic to how I typically want to search for things. For one, the new experience doesn’t actually load code onto the page, it does some sort of lazy loading thing as you scroll around, so ⌘F doesn’t work. I understand that there’s a custom search box to try to get around this but it’s pretty slow and fiddly and I don’t really want to use it. I also find the layout to be pretty annoying, because invariably there’s a symbol panel on the side that doesn’t work for the code I want to look at, and then it’s just there taking space. If I hit “t” to enter a file name and start typing the text field loses focus after a second and I need to click on it again. I know there are a couple of people on the team in this thread: I search a lot of code on GitHub and I feel like there’s a couple of tweaks that would greatly improve my experience. Like, I think I could even show you a video of all the places where the UI has gotten less usable for me. What would be the best way to get this feedback to you? I’ve posted stuff on the forum or whatever but it’s unclear to me if this is the intended way to raise issues.

评论 #34683215 未加载

kjuulhover 2 years ago

I really like the new search. Though sometimes it is a bit deceptive. I.e. when searching for a function name by clicking on a piece of code and suddenly you are in an entitely different code base with an unrelated function though it shares the name.It feels like github code browsing is a step between a full editor with lsp and a static site. I Hope they work out the Kinks and make it more smooth

评论 #34693181 未加载

评论 #34681960 未加载

tonymetover 2 years ago

This is a great intro / overview of full-text search for those wondering how to build your own search engine.It's a great 101-level exercise to write an inverted index implementation you can do it in an afternoon , and then expand to a leaf /aggregator in follow-up exercises.

评论 #34685418 未加载

评论 #34682121 未加载

drcongoover 2 years ago

With current search, I can search [0] the Django repo for a class that definitely exists [1] in Django, there are 0 code results. Zero. GitHub search is mystifyingly bad, I hope this is a LOT better.[0] <a href="https://github.com/django/django/search?q=DeleteView&type=code">https://github.com/django/django/search?q=DeleteView&type=co...</a>[1] <a href="https://github.com/django/django/blob/main/django/views/generic/edit.py#L268">https://github.com/django/django/blob/main/django/views/gene...</a>

评论 #34685262 未加载

debdutover 2 years ago

<a href="https://grep.app" rel="nofollow">https://grep.app</a>

评论 #34682849 未加载

评论 #34705696 未加载

gavinrayover 2 years ago

I just want to say thank-you to the folks who work on Code Search at GitHub.It's the number one way I research and understand new libraries/API's and programming languages.There's a lot more you can learn from usage in the wild than tutorial posts sometimes.

Daffodilsover 2 years ago

Was looking for more details on the data structure 'Geometric filter' mentioned in the footnotes. Couldn't find anything (a few unrelated papers in object recognition aside). If anybody can share anything that would be great !

评论 #34682726 未加载

solarkraftover 2 years ago

Damn, it's about time, the current search sucks. What I have found to work very well is SourceGraph; they offer search for public repos. Maybe this'll be an alternative to it.

tuanover 2 years ago

I wish they provide short name versions for their filters. For example: instead of "withContext language:python path:tests", I could write "withContext l:python p:tests".

Existenceblinksover 2 years ago

Blackbird written in Rust is a natural approach. Those who try to sell build the whole thing with a whole thing is unwise (look at you isomorphic javascript)

评论 #34681755 未加载

purkkaover 2 years ago

My biggest feature request would be sorting or filtering by code/commit/repo age, or even repo stars.Most often I end up using code search for figuring out where a piece of code originated, just to find thousands of random projects that have also copied the same code verbatim. Sorting for "relevance" or "latest/oldest indexed" are equally useless.

mperhamover 2 years ago

On the spectrum of "build vs buy", this is a good example where a business should build it. Scaling code search is their core value.

chatmastaover 2 years ago

In general, I really recommend code search as a tool for supplementing reading the documentation and source code of your dependencies (you are reading the source code, right?). I reach for it almost every day, and I find it's a reliable tool for identifying "the right way" to use a library, especially one that isn't fully documented.

robertlagrantover 2 years ago

Not to diminish this excellent work, but:1) I never want to search all repos globally. At worst I want to search all of my org's repos.2) the search UI is a little clunky, in a way I'd need to be using it again to remember.Between those two I think there's loads of progress to be made outside of raw search power. Of course it's nice to have that, but that's what I'm really after.

评论 #34683008 未加载

loginatnineover 2 years ago

I'm curious if they'll open source Blackbird, it does not seem mentioned in the post.

评论 #34712626 未加载

评论 #34687515 未加载

Beefinover 2 years ago

If you ever want to search binary files (image, video, pdf, etc.) within github repos: <a href="https://learn.mixpeek.com/github-search/" rel="nofollow">https://learn.mixpeek.com/github-search/</a>

_pastelover 2 years ago

So in the sparse grams explanation, what are the bigram weights?Is it inverse frequency, so common bigrams get split last? And the goal is to be able to search on a larger gram that covers the more common trigrams as often as possible?

Waterluvianover 2 years ago

This looks delightful!One nit I have about current search: I’ll look something up and find I’m getting results for some obtuse commit in some old branch somewhere. I’d like to be able to optionally say “latest commit on branches only please” or “main branch only please.”Another thing, which might betray that I don’t understand search all that well: language aware searching that knows, for example, that a single or a double quote are syntactically interchangeable. Don’t omit half the results because I used one quote over the other when looking up `interpolation = ‘nearest’`

j1eloover 2 years ago

Will this allow for a happy closure of this question about searching partial words? [0]Like searching for "OPTION" and getting "-DOPTION=TRUE" among the results. Very commonly needed to find all usages of a flag, even instances where the flag is being passed to (at least, that I know of) CMake and Meson.[0]: <a href="https://stackoverflow.com/questions/43891605/search-partial-words-in-github-organizations-code" rel="nofollow">https://stackoverflow.com/questions/43891605/search-partial-...</a>

评论 #34686319 未加载

imadethisover 2 years ago

Sourcegraph should’ve accepted that offer from GitHub.

评论 #34681712 未加载

评论 #34682378 未加载

评论 #34681588 未加载

评论 #34681662 未加载

Cian911over 2 years ago

> Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.What exactly do they mean by "special repositories" here?

评论 #34694823 未加载

WoodenChairover 2 years ago

The biggest problems I have with their code search are basic usability features, not the search itself. I need a way to exclude private repositories in the result so I’m not clogged by internal instances of what I’m looking for. I need the UI to improve so I don’t have to go to advanced search for every filter I want to do.

Royaljjover 2 years ago

Interesting stuff, was curious how they search repeated letters through ngram index? I understand their example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”. However, if the user wants to search the string “aaaaa” how would they go about searching that?

ZephyrBluover 2 years ago

Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: <a href="https://pagefind.app/" rel="nofollow">https://pagefind.app/</a>.For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): <a href="https://swtch.com/~rsc/regexp/regexp4.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp4.html</a>.Algolia also has a series of blog posts describing how their search engine works: <a href="https://www.algolia.com/blog/engineering/inside-the-algolia-engine-part-1-indexing-vs-search/" rel="nofollow">https://www.algolia.com/blog/engineering/inside-the-algolia-...</a>.---It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:"Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:"Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it.""Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."<a href="https://stackshare.io/posts/how-algolia-built-their-realtime-search-as-a-service-product" rel="nofollow">https://stackshare.io/posts/how-algolia-built-their-realtime...</a>100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on <a href="https://hn.algolia.com" rel="nofollow">https://hn.algolia.com</a> I've seen single digit millisecond search times in the responses, which seems to corroborate this.I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".Search is so cool. I could talk about it all day.

评论 #34683651 未加载

hbnover 2 years ago

I've been using this since it was still an email signup beta. I don't do anything too complicated, but man it's been invaluable to do exact-string searches across all of my organization's repos. I use it most days at work

webmavenover 2 years ago

Blackbird? I wonder if the name is coincidence or irony:<a href="https://en.m.wikipedia.org/wiki/Blackbird_(online_platform)" rel="nofollow">https://en.m.wikipedia.org/wiki/Blackbird_(online_platform)</a>

评论 #34689708 未加载

sidcoolover 2 years ago

I feel search is the most complex domain tech wise. I always feel overwhelmed how people design such systems. Would love to learn more about search. Any books or courses? Right now I can only do binary search.

jd3over 2 years ago

I was working on a research project awhile ago and every time I searched for something particular it immediately thought I was a bot after like 2-3 particular/exact queries.Ever since then, I've exclusively used sourcegraph.

latchkeyover 2 years ago

I wonder if they've done any work to deal with index pollution, such that one can achieve higher ranking in the results?

user3939382over 2 years ago

The cursor position in the free-form query terms in the search input doesn’t align correctly when the input contains tags.

tantalorover 2 years ago

Why not kythe?<a href="https://kythe.io/" rel="nofollow">https://kythe.io/</a>

评论 #34682494 未加载

bjd2385over 2 years ago

When can we have a usable search in GitLab?

评论 #34683510 未加载

duckydude20over 2 years ago

working with this much data is like high voltage engineering. so fascinating...

cozosover 2 years ago

I have been waiting for this for so long.

thinking001001over 2 years ago

Hmm not sure if I should delete my (2nd) Github account again, just thinking about how much data they are getting from users, it could become the Facebook of Git.

44 comments

ezekgover 2 years ago

评论 #34681633 未加载

评论 #34681958 未加载

评论 #34682533 未加载

评论 #34682306 未加载

评论 #34684006 未加载

评论 #34686354 未加载

评论 #34684347 未加载

评论 #34686077 未加载

评论 #34686383 未加载

评论 #34683311 未加载

andrewmcwattersover 2 years ago

评论 #34683783 未加载

评论 #34684279 未加载

评论 #34684782 未加载

评论 #34684797 未加载

评论 #34684108 未加载

评论 #34687245 未加载

评论 #34686213 未加载

评论 #34702051 未加载

PaulHouleover 2 years ago

评论 #34689017 未加载

评论 #34689558 未加载

评论 #34685220 未加载

评论 #34688857 未加载

kirillbobyrevover 2 years ago

Scaevolusover 2 years ago

评论 #34689494 未加载

boyterover 2 years ago

评论 #34683714 未加载

评论 #34682542 未加载

评论 #34689970 未加载

ZephyrBluover 2 years ago

评论 #34681661 未加载

评论 #34690967 未加载

colin353over 2 years ago

评论 #34689360 未加载

评论 #34690492 未加载

simonwover 2 years ago

评论 #34682234 未加载

saagarjhaover 2 years ago

评论 #34683215 未加载

kjuulhover 2 years ago

评论 #34693181 未加载

评论 #34681960 未加载

tonymetover 2 years ago

评论 #34685418 未加载

评论 #34682121 未加载

drcongoover 2 years ago

评论 #34685262 未加载

debdutover 2 years ago

<a href="https://grep.app" rel="nofollow">https://grep.app</a>

评论 #34682849 未加载

评论 #34705696 未加载

gavinrayover 2 years ago

Daffodilsover 2 years ago

评论 #34682726 未加载

solarkraftover 2 years ago

Damn, it's about time, the current search sucks. What I have found to work very well is SourceGraph; they offer search for public repos. Maybe this'll be an alternative to it.

tuanover 2 years ago

I wish they provide short name versions for their filters. For example: instead of "withContext language:python path:tests", I could write "withContext l:python p:tests".

Existenceblinksover 2 years ago

Blackbird written in Rust is a natural approach. Those who try to sell build the whole thing with a whole thing is unwise (look at you isomorphic javascript)

评论 #34681755 未加载

purkkaover 2 years ago

mperhamover 2 years ago

On the spectrum of "build vs buy", this is a good example where a business should build it. Scaling code search is their core value.

chatmastaover 2 years ago

robertlagrantover 2 years ago

评论 #34683008 未加载

loginatnineover 2 years ago

I'm curious if they'll open source Blackbird, it does not seem mentioned in the post.

评论 #34712626 未加载

评论 #34687515 未加载

Beefinover 2 years ago

If you ever want to search binary files (image, video, pdf, etc.) within github repos: <a href="https://learn.mixpeek.com/github-search/" rel="nofollow">https://learn.mixpeek.com/github-search/</a>

_pastelover 2 years ago

Waterluvianover 2 years ago

j1eloover 2 years ago

评论 #34686319 未加载

imadethisover 2 years ago

Sourcegraph should’ve accepted that offer from GitHub.

评论 #34681712 未加载

评论 #34682378 未加载

评论 #34681588 未加载

评论 #34681662 未加载

Cian911over 2 years ago

评论 #34694823 未加载

WoodenChairover 2 years ago

Royaljjover 2 years ago

ZephyrBluover 2 years ago

评论 #34683651 未加载

hbnover 2 years ago

webmavenover 2 years ago

评论 #34689708 未加载

sidcoolover 2 years ago

jd3over 2 years ago

latchkeyover 2 years ago

I wonder if they've done any work to deal with index pollution, such that one can achieve higher ranking in the results?

user3939382over 2 years ago

The cursor position in the free-form query terms in the search input doesn’t align correctly when the input contains tags.

tantalorover 2 years ago

Why not kythe?<a href="https://kythe.io/" rel="nofollow">https://kythe.io/</a>

评论 #34682494 未加载

bjd2385over 2 years ago

When can we have a usable search in GitLab?

评论 #34683510 未加载

duckydude20over 2 years ago

working with this much data is like high voltage engineering. so fascinating...

cozosover 2 years ago

I have been waiting for this for so long.

thinking001001over 2 years ago

Hmm not sure if I should delete my (2nd) Github account again, just thinking about how much data they are getting from users, it could become the Facebook of Git.