TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Python code to solve xkcd 1313 by Peter Norvig

447 点作者 weslly超过 11 年前

20 条评论

temuze超过 11 年前
This is a great article. It&#x27;s pretty fun to play around with this heuristic:<p><pre><code> lambda c: 3*len(matches(c, uncovered)) - len(c) </code></pre> Here&#x27;s a trivial way to explore it: say we generalize the heuristic to H(a, b).<p><pre><code> H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c) </code></pre> The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we&#x27;d get smaller results.<p><pre><code> def findregex_lambda(winners, losers, a, b): &quot;Find a regex that matches all winners but no losers (sets of strings).&quot; # Make a pool of candidate components, then pick from them to cover winners. # On each iteration, add the best component to &#x27;cover&#x27;; finally disjoin them together. pool = candidate_components(winners, losers) cover = [] while winners: best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c)) cover.append(best) pool.remove(best) winners = winners - matches(best, winners) return &#x27;|&#x27;.join(cover) &gt;&gt;&gt; findregex_lambda(starwars, startrek, 3, 1) &#x27; T|E.P| N&#x27; &gt;&gt;&gt; findregex_lambda(starwars, startrek, 3, 2) &#x27; T|B| N| M&#x27; </code></pre> Or, to automate this:<p><pre><code> def best_H_heuristic(winners, losers): d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)} return min(d, key=d.get) &gt;&gt;&gt; best_H_heuristic(starwars, startrek): (3,1) </code></pre> Looks like H(3,1) is pretty good for this case. What about the nfl teams?<p><pre><code> &gt;&gt;&gt; best_H_heuristic(nfl_in, nfl_out) (3, 2) &gt;&gt;&gt; findregex_lambda(nfl_in, nfl_out, 3, 1) &#x27;pa|g..s|4|fs|sa|se|lt|os&#x27; &gt;&gt;&gt; findregex_lambda(nfl_in, nfl_out, 3, 2) &#x27;pa|ch|4|e.g|sa|se|lt|os&#x27; </code></pre> Not the best heuristic there. H(3,1) wins or ties for the boys&#x2F;girls set, left&#x2F;right set and drugs&#x2F;cities set, which just goes to show you that picking a heuristic off a gut guess isn&#x27;t such a bad approach.<p>You could also explore heuristics of different forms:<p><pre><code> M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e </code></pre> Or trying completely different formats:<p><pre><code> L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)</code></pre>
throwaway_yy2Di超过 11 年前
I don&#x27;t know why Randall&#x27;s regex incorrectly (?) matches &quot;Fremont&quot;, but it&#x27;s worth noting Wikipedia&#x27;s primary spelling has an accent aigu &quot;Frémont&quot;:<p><a href="https://en.wikipedia.org/wiki/John_C._Frémont" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;John_C._Frémont</a>
评论 #7015556 未加载
j2kun超过 11 年前
One thing not mentioned in this article:<p>1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.<p>2. Unless P != NP, set cover cannot be approximated better than the greedy algorithm. In other words, the only general solutions you&#x27;ll find (unless you&#x27;re using some special insight about how regular expressions cover sets of strings) will be no better than a constant factor improvement in produced regex size than the greedy algorithm.<p>That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.
blt超过 11 年前
I love Norvig&#x27;s Python posts. He really gets the spirit of the language and has fun with it.
评论 #7016089 未加载
评论 #7016420 未加载
ddebernardy超过 11 年前
This was posted a few days ago on Code Golf:<p><a href="http://codegolf.stackexchange.com/questions/17718/meta-regex-golf" rel="nofollow">http:&#x2F;&#x2F;codegolf.stackexchange.com&#x2F;questions&#x2F;17718&#x2F;meta-regex...</a><p>That link includes a perl 10-liner to do the same.
评论 #7016176 未加载
评论 #7015394 未加载
haberman超过 11 年前
I thought it was going to be meta-meta-regex golf, and couldn&#x27;t imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)
评论 #7015761 未加载
firegrind超过 11 年前
When I read &#x27;subtitles&#x27;, i wondered about the .srt files of the movies.
评论 #7016663 未加载
tlarkworthy超过 11 年前
Exercise for the reader, write a regex to distinguish random noise from English<p>EDIT: possibly down-voted because someone though it was sarcastic???<p>I was actually thinking of this problem before the XKCD comic, for detecting hashes on hardrives efficiently...
评论 #7016821 未加载
评论 #7021546 未加载
a3_nm超过 11 年前
Interestingly, finding a minimal-size regexp satisfying a set of positive and negative examples (words that should match, and should not match) is NP-hard. Here is a nice discussion: <a href="http://cstheory.blogoverflow.com/2011/08/on-learning-regular-languages/" rel="nofollow">http:&#x2F;&#x2F;cstheory.blogoverflow.com&#x2F;2011&#x2F;08&#x2F;on-learning-regular...</a>
z-e-r-o超过 11 年前
Can someone explain what does this line mean and why does he use it as heuristic?<p><pre><code> key=lambda c: 3*len(matches(c, uncovered)) - len(c)</code></pre>
评论 #7016002 未加载
评论 #7016058 未加载
joyofpi超过 11 年前
I think it fails for: findregex(set([&#x27;abc&#x27;]), set([&#x27;abcd&#x27;]))
评论 #7016255 未加载
donniezazen超过 11 年前
Is Python Peter Norvig&#x27;s preferred language (along with Lisp, I suppose)?
评论 #7017852 未加载
评论 #7017074 未加载
fwenzel超过 11 年前
I am not sure why Norvig omits president Obama. That said, &quot;[mtg]a&quot; does match him, so at least Munroe tries.
评论 #7016092 未加载
评论 #7018306 未加载
评论 #7015700 未加载
gwern超过 11 年前
It&#x27;s too bad he didn&#x27;t try to tackle the optimal regexp problem and settled for approximations - it may be a NP-hard problem, but all the example solutions are short enough that the instances might be still tractable. Would&#x27;ve been nice to know for sure.
评论 #7018350 未加载
评论 #7018206 未加载
评论 #7022327 未加载
josephlord超过 11 年前
If you just want to play regex golf this site appeared before Christmas and there was quite a discussion [1] although there are a few more levels now: <a href="http://regex.alf.nu/" rel="nofollow">http:&#x2F;&#x2F;regex.alf.nu&#x2F;</a><p>I&#x27;m still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).<p>[1] <a href="http://news.ycombinator.com/item?id=6941231" rel="nofollow">http:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=6941231</a>
j2kun超过 11 年前
What tool does Norvig use to create this json file? Does iPython have this as a feature (somehow allowing formatted text)?
评论 #7018498 未加载
shdon超过 11 年前
With the given set,<p><pre><code> &#x2F;M | [TN]|B&#x2F; </code></pre> is suboptimal, but could be<p><pre><code> &#x2F; [TMN]|B&#x2F; </code></pre> But that (and the article) leaves out the subtitle for Star Trek 1: &quot;The Motion Picture&quot;. For that, Randall&#x27;s original expression works.
sushirain超过 11 年前
What would be a use for finding a minimal discriminating regex? Perhaps understanding the difference between boys&#x27; and girls&#x27; names?
评论 #7016810 未加载
评论 #7017024 未加载
评论 #7016664 未加载
thewarrior超过 11 年前
Could this be used as an alternative to a bloom filter ?
评论 #7017811 未加载
LambdaAlmighty超过 11 年前
&lt;meta-comment on meta-golf&gt;<p>Judging by the amount of fawning here, this guy must be a HN celebrity. Interesting post nevertheless!<p>I can only hope, one day, I&#x27;d be writing and publishing joyful little hacks like this, to such general applause, instead of eking out a living. I have to say I&#x27;m a bit envious here!<p>Well done to the dude. An inspirational post, in many ways.