TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Why you should not parse (X)HTML with a Regexp

13 pointsby supertedabout 14 years ago

4 comments

obtinoabout 14 years ago
The technical explanation for this is given in comment 3 of the page and sums it up perfectly:<p>"I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up."<p>More info: <a href="http://en.wikipedia.org/wiki/Chomsky_hierarchy" rel="nofollow">http://en.wikipedia.org/wiki/Chomsky_hierarchy</a>
iwwrabout 14 years ago
<i>Even Jon Skeet cannot parse HTML using regular expressions.</i>
评论 #2423432 未加载
wvlabout 14 years ago
And the previous discussion:<p><a href="http://news.ycombinator.com/item?id=1487695" rel="nofollow">http://news.ycombinator.com/item?id=1487695</a>
d_rabout 14 years ago
Fortunately, BeautifulSoup saves the day for HTML parsing tasks.<p>(<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>)