Python code to solve xkcd 1313 by Peter Norvig

447 点作者 weslly超过 11 年前

20 条评论

temuze超过 11 年前

This is a great article. It's pretty fun to play around with this heuristic:<pre><code> lambda c: 3*len(matches(c, uncovered)) - len(c) </code></pre> Here's a trivial way to explore it: say we generalize the heuristic to H(a, b).<pre><code> H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c) </code></pre> The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results.<pre><code> def findregex_lambda(winners, losers, a, b): "Find a regex that matches all winners but no losers (sets of strings)." # Make a pool of candidate components, then pick from them to cover winners. # On each iteration, add the best component to 'cover'; finally disjoin them together. pool = candidate_components(winners, losers) cover = [] while winners: best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c)) cover.append(best) pool.remove(best) winners = winners - matches(best, winners) return '|'.join(cover) >>> findregex_lambda(starwars, startrek, 3, 1) ' T|E.P| N' >>> findregex_lambda(starwars, startrek, 3, 2) ' T|B| N| M' </code></pre> Or, to automate this:<pre><code> def best_H_heuristic(winners, losers): d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)} return min(d, key=d.get) >>> best_H_heuristic(starwars, startrek): (3,1) </code></pre> Looks like H(3,1) is pretty good for this case. What about the nfl teams?<pre><code> >>> best_H_heuristic(nfl_in, nfl_out) (3, 2) >>> findregex_lambda(nfl_in, nfl_out, 3, 1) 'pa|g..s|4|fs|sa|se|lt|os' >>> findregex_lambda(nfl_in, nfl_out, 3, 2) 'pa|ch|4|e.g|sa|se|lt|os' </code></pre> Not the best heuristic there. H(3,1) wins or ties for the boys/girls set, left/right set and drugs/cities set, which just goes to show you that picking a heuristic off a gut guess isn't such a bad approach.You could also explore heuristics of different forms:<pre><code> M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e </code></pre> Or trying completely different formats:<pre><code> L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)</code></pre>

throwaway_yy2Di超过 11 年前

I don't know why Randall's regex incorrectly (?) matches "Fremont", but it's worth noting Wikipedia's primary spelling has an accent aigu "Frémont":<a href="https://en.wikipedia.org/wiki/John_C._Frémont" rel="nofollow">https://en.wikipedia.org/wiki/John_C._Frémont</a>

评论 #7015556 未加载

j2kun超过 11 年前

One thing not mentioned in this article:1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.2. Unless P != NP, set cover cannot be approximated better than the greedy algorithm. In other words, the only general solutions you'll find (unless you're using some special insight about how regular expressions cover sets of strings) will be no better than a constant factor improvement in produced regex size than the greedy algorithm.That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.

blt超过 11 年前

I love Norvig's Python posts. He really gets the spirit of the language and has fun with it.

评论 #7016089 未加载

评论 #7016420 未加载

ddebernardy超过 11 年前

This was posted a few days ago on Code Golf:<a href="http://codegolf.stackexchange.com/questions/17718/meta-regex-golf" rel="nofollow">http://codegolf.stackexchange.com/questions/17718/meta-regex...</a>That link includes a perl 10-liner to do the same.

评论 #7016176 未加载

评论 #7015394 未加载

haberman超过 11 年前

I thought it was going to be meta-meta-regex golf, and couldn't imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)

评论 #7015761 未加载

firegrind超过 11 年前

When I read 'subtitles', i wondered about the .srt files of the movies.

评论 #7016663 未加载

tlarkworthy超过 11 年前

Exercise for the reader, write a regex to distinguish random noise from EnglishEDIT: possibly down-voted because someone though it was sarcastic???I was actually thinking of this problem before the XKCD comic, for detecting hashes on hardrives efficiently...

评论 #7016821 未加载

评论 #7021546 未加载

a3_nm超过 11 年前

Interestingly, finding a minimal-size regexp satisfying a set of positive and negative examples (words that should match, and should not match) is NP-hard. Here is a nice discussion: <a href="http://cstheory.blogoverflow.com/2011/08/on-learning-regular-languages/" rel="nofollow">http://cstheory.blogoverflow.com/2011/08/on-learning-regular...</a>

z-e-r-o超过 11 年前

Can someone explain what does this line mean and why does he use it as heuristic?<pre><code> key=lambda c: 3*len(matches(c, uncovered)) - len(c)</code></pre>

评论 #7016002 未加载

评论 #7016058 未加载

joyofpi超过 11 年前

I think it fails for: findregex(set(['abc']), set(['abcd']))

评论 #7016255 未加载

donniezazen超过 11 年前

Is Python Peter Norvig's preferred language (along with Lisp, I suppose)?

评论 #7017852 未加载

评论 #7017074 未加载

fwenzel超过 11 年前

I am not sure why Norvig omits president Obama. That said, "[mtg]a" does match him, so at least Munroe tries.

评论 #7016092 未加载

评论 #7018306 未加载

评论 #7015700 未加载

gwern超过 11 年前

It's too bad he didn't try to tackle the optimal regexp problem and settled for approximations - it may be a NP-hard problem, but all the example solutions are short enough that the instances might be still tractable. Would've been nice to know for sure.

评论 #7018350 未加载

评论 #7018206 未加载

评论 #7022327 未加载

josephlord超过 11 年前

If you just want to play regex golf this site appeared before Christmas and there was quite a discussion [1] although there are a few more levels now: <a href="http://regex.alf.nu/" rel="nofollow">http://regex.alf.nu/</a>I'm still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).[1] <a href="http://news.ycombinator.com/item?id=6941231" rel="nofollow">http://news.ycombinator.com/item?id=6941231</a>

j2kun超过 11 年前

What tool does Norvig use to create this json file? Does iPython have this as a feature (somehow allowing formatted text)?

评论 #7018498 未加载

shdon超过 11 年前

With the given set,<pre><code> /M | [TN]|B/ </code></pre> is suboptimal, but could be<pre><code> / [TMN]|B/ </code></pre> But that (and the article) leaves out the subtitle for Star Trek 1: "The Motion Picture". For that, Randall's original expression works.

sushirain超过 11 年前

What would be a use for finding a minimal discriminating regex? Perhaps understanding the difference between boys' and girls' names?

评论 #7016810 未加载

评论 #7017024 未加载

评论 #7016664 未加载

thewarrior超过 11 年前

Could this be used as an alternative to a bloom filter ?

评论 #7017811 未加载

LambdaAlmighty超过 11 年前

20 条评论

temuze超过 11 年前

throwaway_yy2Di超过 11 年前

评论 #7015556 未加载

j2kun超过 11 年前

blt超过 11 年前

I love Norvig's Python posts. He really gets the spirit of the language and has fun with it.

评论 #7016089 未加载

评论 #7016420 未加载

ddebernardy超过 11 年前

评论 #7016176 未加载

评论 #7015394 未加载

haberman超过 11 年前

I thought it was going to be meta-meta-regex golf, and couldn't imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)

评论 #7015761 未加载

firegrind超过 11 年前

When I read 'subtitles', i wondered about the .srt files of the movies.

评论 #7016663 未加载

tlarkworthy超过 11 年前

评论 #7016821 未加载

评论 #7021546 未加载

a3_nm超过 11 年前

z-e-r-o超过 11 年前

Can someone explain what does this line mean and why does he use it as heuristic?<pre><code> key=lambda c: 3*len(matches(c, uncovered)) - len(c)</code></pre>

评论 #7016002 未加载

评论 #7016058 未加载

joyofpi超过 11 年前

I think it fails for: findregex(set(['abc']), set(['abcd']))

评论 #7016255 未加载

donniezazen超过 11 年前

Is Python Peter Norvig's preferred language (along with Lisp, I suppose)?

评论 #7017852 未加载

评论 #7017074 未加载

fwenzel超过 11 年前

I am not sure why Norvig omits president Obama. That said, "[mtg]a" does match him, so at least Munroe tries.

评论 #7016092 未加载

评论 #7018306 未加载

评论 #7015700 未加载

gwern超过 11 年前

评论 #7018350 未加载

评论 #7018206 未加载

评论 #7022327 未加载

josephlord超过 11 年前

j2kun超过 11 年前

What tool does Norvig use to create this json file? Does iPython have this as a feature (somehow allowing formatted text)?

评论 #7018498 未加载

shdon超过 11 年前

sushirain超过 11 年前

What would be a use for finding a minimal discriminating regex? Perhaps understanding the difference between boys' and girls' names?