TechEcho

The technical explanation for this is given in comment 3 of the page and sums it up perfectly:"I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up."More info: <a href="http://en.wikipedia.org/wiki/Chomsky_hierarchy" rel="nofollow">http://en.wikipedia.org/wiki/Chomsky_hierarchy</a>

Even Jon Skeet cannot parse HTML using regular expressions.

And the previous discussion:<a href="http://news.ycombinator.com/item?id=1487695" rel="nofollow">http://news.ycombinator.com/item?id=1487695</a>

Fortunately, BeautifulSoup saves the day for HTML parsing tasks.(<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>)

Even Jon Skeet cannot parse HTML using regular expressions.

And the previous discussion:<a href="http://news.ycombinator.com/item?id=1487695" rel="nofollow">http://news.ycombinator.com/item?id=1487695</a>

Fortunately, BeautifulSoup saves the day for HTML parsing tasks.(<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>)

Why you should not parse (X)HTML with a Regexp

4 comments

Why you should not parse (X)HTML with a Regexp

4 comments