I use their new code search a lot to grok how people use certain features, or implement certain things. But I do wish there was a way to filter out forks. Sometimes I search a string and just get a bunch of forks all with the same result. For example, searching a common class in a Rails app often just shows a bunch of rails/rails forks, which is a lot of noise to sift through when you're trying to see how devs commonly use a certain feature.
> Just use grep?
First though, let’s explore the brute force approach to the problem. We get this question a lot: “Why don’t you just use grep?” To answer that, let’s do a little napkin math using ripgrep on that 115 TB of content. On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.<p>But you don't NEED to do this do you? I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.<p>Maybe, maybe on a really serious day do I need to search an entire organization. But hardly ever.<p>I have never, in over a decade ever, wanted sophisticated symbolic searching from GitHub code search, I just need remote grep.<p>Why is the code search not feature bisected into this 99% use case, and then the occasional global repository search, which can behave entirely differently?
My beef with GitHub's code search is that it doesn't distinguish between the definition of a symbol and the uses of the symbol, so you need to wade through 5 pages of results to get the one result you're looking for. I would contrast that to my IDE which usually scores a direct hit if I enter a search in the right box.<p>The indexing they talk about in that article seems like rearranging the deck chairs on the Titanic so far as that is concerned.
This is exciting! I see a lot of familiar pieces here that propagated from Google's Code Search and I know few people from Code Search went to GitHub, probably specifically to work on this. I always wondered why GitHub didn't invest into a decent code searching features, but I'm happy it finally gets to the State of the Art one step at a time. Some of the folks going to GitHub to work on this I know are just incredible and I have no doubt GitHub's code search will be amazing.<p>I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).<p>Here is the implementation:<p><a href="https://github.com/llvm/llvm-project/tree/main/clang-tools-extra/clangd/index/dex">https://github.com/llvm/llvm-project/tree/main/clang-tools-e...</a><p>I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:<p><a href="https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit" rel="nofollow">https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...</a>
As a comparison to Sourcegraph: Sourcegraph shards and indexes a repository at a time, and uses trigrams and bloom filters (to skip shards).<p>Github shards and indexes individual files according to their hashes. It also uses variable length ngrams (neat!). This makes horizontal scaling simpler, but also means more of the index needs to be scanned for org/repo-scoped queries ("Due to our sharding strategy, a query request must be sent to each shard in the cluster.").
The sparse grams solution to deal with stupidly common ngrams such as for or tes is very interesting.<p>I’d love to see more discussion on how they are dealing with the false positives though. It looks like a positional index is being used to achieve this, but that usually blows out your index size.<p>Additional information about deduplication would be especially interesting to me as well. It seems to solve this quite well. I usually try a search of Jquery to test this and it does not return multiple copies of different versions of it which is a good indicator that it’s slightly fuzzy.<p>What I find really interesting about all the code search engines I know of is that each one implemented its own index. Nobody is using off the shelf software for this. I suspect that might be down to no off the shelf software providing a decent enough solution, and none providing a solution that scales. At least none that scales with decent costs.<p>I did a small comparison of GitHub code search a while ago <a href="https://twitter.com/boyter/status/1480667185475244036?s=61&t=nfND46d9rReCju7-aw457Q" rel="nofollow">https://twitter.com/boyter/status/1480667185475244036?s=61&t...</a> But I should note a lot has improved since then, and it looks like sourcegraph now also does default AND of terms rather than exact match, so my complaints there are resolved.<p>Impressive work by GitHub. I am sure some of the people behind it will read this comment, let me say well done to you all. I am very impressed. Also please post more information like this. There is so little out there.
I really hope they release this soon and that it’s actually good.<p>The current search sucks ass, you can’t find anything.<p>I was trying to search for something in the WebKit source the other day and I had to use Sourcegraph because the GitHub search gave me zero results.
Hey everyone, I'm Colin from GitHub's code search team: happy to answer any questions people have about it. Also, you can sign up to get access here: <a href="https://github.com/features/code-search">https://github.com/features/code-search</a>
I really appreciate that this includes details about how search permissions work - how they ensure that search results include data from my private repos.<p>I'd always wondered how they implemented that: it turns out they add extra internal filters to their searches along the lines of "RepoIDs(...) or PublicRepo".<p>Question for the team: Do you have an additional permission check in the view layer before the results are shown to the end-user? I worry that if I switch a repo from public to private it may take a while for the code search index to catch up to the new permissions.
I’ve been using the new code search for a couple of months and I like it, but the UI is kind of antagonistic to how I typically want to search for things. For one, the new experience doesn’t actually load code onto the page, it does some sort of lazy loading thing as you scroll around, so ⌘F doesn’t work. I understand that there’s a custom search box to try to get around this but it’s pretty slow and fiddly and I don’t really want to use it. I also find the layout to be pretty annoying, because invariably there’s a symbol panel on the side that doesn’t work for the code I want to look at, and then it’s just there taking space. If I hit “t” to enter a file name and start typing the text field loses focus after a second and I need to click on it again. I know there are a couple of people on the team in this thread: I search a lot of code on GitHub and I feel like there’s a couple of tweaks that would greatly improve my experience. Like, I think I could even show you a video of all the places where the UI has gotten less usable for me. What would be the best way to get this feedback to you? I’ve posted stuff on the forum or whatever but it’s unclear to me if this is the intended way to raise issues.
I really like the new search. Though sometimes it is a bit deceptive. I.e. when searching for a function name by clicking on a piece of code and suddenly you are in an entitely different code base with an unrelated function though it shares the name.<p>It feels like github code browsing is a step between a full editor with lsp and a static site. I Hope they work out the Kinks and make it more smooth
This is a great intro / overview of full-text search for those wondering how to build your own search engine.<p>It's a great 101-level exercise to write an inverted index implementation you can do it in an afternoon , and then expand to a leaf /aggregator in follow-up exercises.
With current search, I can search [0] the Django repo for a class that definitely exists [1] in Django, there are 0 code results. Zero. GitHub search is mystifyingly bad, I hope this is a LOT better.<p>[0] <a href="https://github.com/django/django/search?q=DeleteView&type=code">https://github.com/django/django/search?q=DeleteView&type=co...</a><p>[1] <a href="https://github.com/django/django/blob/main/django/views/generic/edit.py#L268">https://github.com/django/django/blob/main/django/views/gene...</a>
I just want to say thank-you to the folks who work on Code Search at GitHub.<p>It's the number one way I research and understand new libraries/API's and programming languages.<p>There's a lot more you can learn from usage in the wild than tutorial posts sometimes.
Was looking for more details on the data structure 'Geometric filter' mentioned in the footnotes.
Couldn't find anything (a few unrelated papers in object recognition aside). If anybody can share anything that would be great !
Damn, it's about time, the current search sucks. What I have found to work very well is SourceGraph; they offer search for public repos. Maybe this'll be an alternative to it.
I wish they provide short name versions for their filters. For example: instead of "withContext language:python path:tests", I could write "withContext l:python p:tests".
Blackbird written in Rust is a natural approach. Those who try to sell build the whole thing with a whole thing is unwise (look at you isomorphic javascript)
My biggest feature request would be sorting or filtering by code/commit/repo age, or even repo stars.<p>Most often I end up using code search for figuring out where a piece of code originated, just to find thousands of random projects that have also copied the same code verbatim. Sorting for "relevance" or "latest/oldest indexed" are equally useless.
In general, I really recommend code search as a tool for supplementing reading the documentation and source code of your dependencies (you <i>are</i> reading the source code, right?). I reach for it almost every day, and I find it's a reliable tool for identifying "the right way" to use a library, especially one that isn't fully documented.
Not to diminish this excellent work, but:<p>1) I never want to search all repos globally. At worst I want to search all of my org's repos.<p>2) the search UI is a little clunky, in a way I'd need to be using it again to remember.<p>Between those two I think there's loads of progress to be made outside of raw search power. Of course it's nice to have that, but that's what I'm really after.
If you ever want to search binary files (image, video, pdf, etc.) within github repos: <a href="https://learn.mixpeek.com/github-search/" rel="nofollow">https://learn.mixpeek.com/github-search/</a>
So in the sparse grams explanation, what are the bigram weights?<p>Is it inverse frequency, so common bigrams get split last? And the goal is to be able to search on a larger gram that covers the more common trigrams as often as possible?
This looks delightful!<p>One nit I have about current search: I’ll look something up and find I’m getting results for some obtuse commit in some old branch somewhere. I’d like to be able to optionally say “latest commit on branches only please” or “main branch only please.”<p>Another thing, which might betray that I don’t understand search all that well: language aware searching that knows, for example, that a single or a double quote are syntactically interchangeable. Don’t omit half the results because I used one quote over the other when looking up `interpolation = ‘nearest’`
Will this allow for a happy closure of this question about searching partial words? [0]<p>Like searching for "OPTION" and getting "-DOPTION=TRUE" among the results. Very commonly needed to find all usages of a flag, even instances where the flag is being passed to (at least, that I know of) CMake and Meson.<p>[0]: <a href="https://stackoverflow.com/questions/43891605/search-partial-words-in-github-organizations-code" rel="nofollow">https://stackoverflow.com/questions/43891605/search-partial-...</a>
> Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.<p>What exactly do they mean by "special repositories" here?
The biggest problems I have with their code search are basic usability features, not the search itself. I need a way to exclude private repositories in the result so I’m not clogged by internal instances of what I’m looking for. I need the UI to improve so I don’t have to go to advanced search for every filter I want to do.
Interesting stuff, was curious how they search repeated letters through ngram index? I understand their example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”. However, if the user wants to search the string “aaaaa” how would they go about searching that?
Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.<p>You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: <a href="https://pagefind.app/" rel="nofollow">https://pagefind.app/</a>.<p>For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): <a href="https://swtch.com/~rsc/regexp/regexp4.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp4.html</a>.<p>Algolia also has a series of blog posts describing how their search engine works: <a href="https://www.algolia.com/blog/engineering/inside-the-algolia-engine-part-1-indexing-vs-search/" rel="nofollow">https://www.algolia.com/blog/engineering/inside-the-algolia-...</a>.<p>---<p>It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:<p><i>"Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."</i><p>I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:<p><i>"Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."</i><p><i>"Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."</i><p><a href="https://stackshare.io/posts/how-algolia-built-their-realtime-search-as-a-service-product" rel="nofollow">https://stackshare.io/posts/how-algolia-built-their-realtime...</a><p>100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on <a href="https://hn.algolia.com" rel="nofollow">https://hn.algolia.com</a> I've seen single digit millisecond search times in the responses, which seems to corroborate this.<p>I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.<p>Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".<p>Search is so cool. I could talk about it all day.
I've been using this since it was still an email signup beta. I don't do anything too complicated, but man it's been invaluable to do exact-string searches across all of my organization's repos. I use it most days at work
Blackbird? I wonder if the name is coincidence or irony:<p><a href="https://en.m.wikipedia.org/wiki/Blackbird_(online_platform)" rel="nofollow">https://en.m.wikipedia.org/wiki/Blackbird_(online_platform)</a>
I feel search is the most complex domain tech wise. I always feel overwhelmed how people design such systems. Would love to learn more about search. Any books or courses? Right now I can only do binary search.
I was working on a research project awhile ago and every time I searched for something particular it immediately thought I was a bot after like 2-3 particular/exact queries.<p>Ever since then, I've exclusively used sourcegraph.
Hmm not sure if I should delete my (2nd) Github account again, just thinking about how much data they are getting from users, it could become the Facebook of Git.