Google made a mistake in killing code search. Indexing the world's source code and making it searchable is so obviously part of their core mission that I wonder how this decision even got made.<p>Yeah, code search is a niche market numerically speaking, but intellectually and economically (considering the economic impact of software) it is vital. Google was doing so much better a job of it than anybody else that they completely owned the space. How could that not be worth what it cost them to run it? And now we are bereft of anything good.<p>I used to use Google Code Search like this: encounter an unfamiliar language or API construct, go to code search, get page after page of real-world usages. Or like this: wonder how to do X, imagine what code that does X might say, search for snippets of that until hitting a working example of X. It was a powerful learning device that I am sad to lose. I sure hope a startup comes along to get search right for the world's source code. Github ought to, but their search sucks.<p>In any case, congratulations Russ Cox et. al. on a brilliant piece of work.
Interesting solution. I did something a little different for searchco.de when I was implementing regex search.<p>Essentially I took the regex such as [cb]at and then expand it out to get all of the terms, in this case cat and bat and then do a standard fulltex search based on those terms to get as close a match as possible. I then loop over the results to get back those which are exact matches.<p>Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.
This is the most awesome kind of solution: built off of few mostly-off-the-shelf moving parts, simple and easy to understand, and entirely perfect for the problem at hand. This write-up would be awesome teaching material for anyone moving from "I know how to write a program" to "how do I build a clean, elegant system to solve a specific problem?"
To be fair, the reason Perl, Python and PCRE (which all use pretty much the same regex syntax) don't use the linear-time algorithms is because they can't. Features like look-around and matching what you've already matched (/(\w+) \1/ to find repeated words for example) give you more expressivity than regular languages, but also takes away linear time algorithms.
This post is very interesting and informative, esp. the part about indexing trigrams instead of words:<p>> <i>Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.</i><p>As he explains, in order to perform a search using regular expressions, the RE is first converted to an "ordinary" query string, which finds a set of candidate documents; the candidate documents are then loaded in memory, and the "real" regular expression run against them, in order to find only the matching documents.<p>He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?<p>He does say this regarding an example implementation:<p>> <i>The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.</i><p>so maybe that's what Code Search did too.<p>What I'm wondering is this: wouldn't it be interesting to have a full web index based on trigrams, that would let us search not only using RE but also using wildcards (at the end of words or at the beginning)?<p>Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.
Russ's articles are an excellent write-up and explanation.<p>However, many finite-state automata regex implementations have existed for years (e.g. Java <a href="http://cs.au.dk/~amoeller/automaton" rel="nofollow">http://cs.au.dk/~amoeller/automaton</a>) without the backtracking feature, of course. Also of interest is the benchmark data at: <a href="http://tusker.org/regex/regex_benchmark.html" rel="nofollow">http://tusker.org/regex/regex_benchmark.html</a>
This is a really great tool. If it could take the first .csearchindex going up the tree as the current index (somewhat like git does with .git dirs), it could easily top rgrep/ack for searching into projects. (just add line numbers and some match coloring)
Not only does Russ Cox write code and English really well, but he's also on HN. Thanks for the article rsc!<p><a href="http://news.ycombinator.com/user?id=rsc" rel="nofollow">http://news.ycombinator.com/user?id=rsc</a>
I like. Word splitting is very interesting. Today, 2012, I would be hard pressed to provide a reason to use this technique. Splitting an index is a classic complexity/resource trade off (your index has a very predictable compact footprint). Again, today, memory is cheap, wide, uniform, and predictable. Indexes are now cheap and highly specialized. Complexity can be reduced for simplicity. Index specialization now becomes natural. My point here is that this solves a class of very expensive searches with ease, leading wildcard searches et al. Also, couldnt really tell from your code (you may be doing this), but reverse your trigrams in your generated query. If ordered properly, your search will be a lot more efficient.
What were the criteria for determining that there were too many unique 2-grams and too few 3-grams? Did it just come down to too much memory for the former, and barely enough memory for 3-grams?
tl;dr<p>The original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.<p>Tell me something I didn't know.<p>He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.<p>Common sense is underrated.
It's a nice design (in particular Trigram Index), but overall product still failed.<p>My guess is that regular expression search is not as useful as full-text search that general Google Search does.