How Google Code Search Worked

287 pointsby basugasubakuover 13 years ago

16 comments

gruseomover 13 years ago

Google made a mistake in killing code search. Indexing the world's source code and making it searchable is so obviously part of their core mission that I wonder how this decision even got made.Yeah, code search is a niche market numerically speaking, but intellectually and economically (considering the economic impact of software) it is vital. Google was doing so much better a job of it than anybody else that they completely owned the space. How could that not be worth what it cost them to run it? And now we are bereft of anything good.I used to use Google Code Search like this: encounter an unfamiliar language or API construct, go to code search, get page after page of real-world usages. Or like this: wonder how to do X, imagine what code that does X might say, search for snippets of that until hitting a working example of X. It was a powerful learning device that I am sad to lose. I sure hope a startup comes along to get search right for the world's source code. Github ought to, but their search sucks.In any case, congratulations Russ Cox et. al. on a brilliant piece of work.

评论 #3487829 未加载

评论 #3488224 未加载

评论 #3487620 未加载

singularover 13 years ago

I sometimes wonder whether Russ Cox is actually human or is in fact a collective pen name for a group of very talented hackers :)

评论 #3485721 未加载

boyterover 13 years ago

Interesting solution. I did something a little different for searchco.de when I was implementing regex search.Essentially I took the regex such as [cb]at and then expand it out to get all of the terms, in this case cat and bat and then do a standard fulltex search based on those terms to get as close a match as possible. I then loop over the results to get back those which are exact matches.Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.

评论 #3488657 未加载

bri3dover 13 years ago

This is the most awesome kind of solution: built off of few mostly-off-the-shelf moving parts, simple and easy to understand, and entirely perfect for the problem at hand. This write-up would be awesome teaching material for anyone moving from "I know how to write a program" to "how do I build a clean, elegant system to solve a specific problem?"

arnsholtover 13 years ago

To be fair, the reason Perl, Python and PCRE (which all use pretty much the same regex syntax) don't use the linear-time algorithms is because they can't. Features like look-around and matching what you've already matched (/(\w+) \1/ to find repeated words for example) give you more expressivity than regular languages, but also takes away linear time algorithms.

评论 #3487208 未加载

bambaxover 13 years ago

This post is very interesting and informative, esp. the part about indexing trigrams instead of words:> Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.As he explains, in order to perform a search using regular expressions, the RE is first converted to an "ordinary" query string, which finds a set of candidate documents; the candidate documents are then loaded in memory, and the "real" regular expression run against them, in order to find only the matching documents.He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?He does say this regarding an example implementation:> The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.so maybe that's what Code Search did too.What I'm wondering is this: wouldn't it be interesting to have a full web index based on trigrams, that would let us search not only using RE but also using wildcards (at the end of words or at the beginning)?Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.

toddcover 13 years ago

Russ's articles are an excellent write-up and explanation.However, many finite-state automata regex implementations have existed for years (e.g. Java <a href="http://cs.au.dk/~amoeller/automaton" rel="nofollow">http://cs.au.dk/~amoeller/automaton</a>) without the backtracking feature, of course. Also of interest is the benchmark data at: <a href="http://tusker.org/regex/regex_benchmark.html" rel="nofollow">http://tusker.org/regex/regex_benchmark.html</a>

评论 #3488189 未加载

staunchover 13 years ago

I'd love some idea of how large the index was for code search, how many machines it required, and how much total code it was searching.

chover 13 years ago

And here I had always thought Google Code Search was based on some kind of fancy federated radix-tree. Very nice design Russ.

petdogover 13 years ago

This is a really great tool. If it could take the first .csearchindex going up the tree as the current index (somewhat like git does with .git dirs), it could easily top rgrep/ack for searching into projects. (just add line numbers and some match coloring)

ximengover 13 years ago

Not only does Russ Cox write code and English really well, but he's also on HN. Thanks for the article rsc!<a href="http://news.ycombinator.com/user?id=rsc" rel="nofollow">http://news.ycombinator.com/user?id=rsc</a>

dr_rezzyover 13 years ago

I like. Word splitting is very interesting. Today, 2012, I would be hard pressed to provide a reason to use this technique. Splitting an index is a classic complexity/resource trade off (your index has a very predictable compact footprint). Again, today, memory is cheap, wide, uniform, and predictable. Indexes are now cheap and highly specialized. Complexity can be reduced for simplicity. Index specialization now becomes natural. My point here is that this solves a class of very expensive searches with ease, leading wildcard searches et al. Also, couldnt really tell from your code (you may be doing this), but reverse your trigrams in your generated query. If ordered properly, your search will be a lot more efficient.

dmoyover 13 years ago

What were the criteria for determining that there were too many unique 2-grams and too few 3-grams? Did it just come down to too much memory for the former, and barely enough memory for 3-grams?

评论 #3489723 未加载

brown9-2over 13 years ago

Wow, is there anything at Google that Jeff Dean didn't have a hand in?

_investigatorover 13 years ago

tl;drThe original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.Tell me something I didn't know.He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.Common sense is underrated.

评论 #3488507 未加载

dennisgorelikover 13 years ago

It's a nice design (in particular Trigram Index), but overall product still failed.My guess is that regular expression search is not as useful as full-text search that general Google Search does.

评论 #3487019 未加载