TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why you should not parse (X)HTML with a Regexp

13 点作者 superted大约 14 年前

4 条评论

obtino大约 14 年前
The technical explanation for this is given in comment 3 of the page and sums it up perfectly:<p>"I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up."<p>More info: <a href="http://en.wikipedia.org/wiki/Chomsky_hierarchy" rel="nofollow">http://en.wikipedia.org/wiki/Chomsky_hierarchy</a>
iwwr大约 14 年前
<i>Even Jon Skeet cannot parse HTML using regular expressions.</i>
评论 #2423432 未加载
wvl大约 14 年前
And the previous discussion:<p><a href="http://news.ycombinator.com/item?id=1487695" rel="nofollow">http://news.ycombinator.com/item?id=1487695</a>
d_r大约 14 年前
Fortunately, BeautifulSoup saves the day for HTML parsing tasks.<p>(<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>)