TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Parsing HTML Using Regular Expressions

22 pointsby yammesickaalmost 8 years ago

6 comments

jlhawnalmost 8 years ago
Maybe I&#x27;m misunderstanding the question, but it sounds like the question is not asking how to parse HTML with a regex, but how to match HTML open tags specifically.<p>While you obviously can&#x27;t match arbitrary HTML with a regex (because arbitrary levels of nested elements requires a stack-based parser), can you not match HTML tags with a regex? It seems to be that it should be possible since you always have the pattern &#x27;&lt;&#x27; followed by the name of the tag, followed my zero or more &quot;key=quoted-val&quot; attributes, and finally a &#x27;&gt;&#x27; token.<p>So, if the question is limited to just how to parse a single open token then it seems like all of the answers have just decided to echo what they&#x27;ve heard in the past which is &quot;don&#x27;t use regular expressions to parse HTML&quot; when the truth is that a real HTML lexer&#x2F;parser does use regular expressions for creating these &quot;open&quot; and &quot;close&quot; element tokens for the parser.
Tloewaldalmost 8 years ago
This is a fun (and classic) thread and it&#x27;s worth reading the pro and con arguments.<p>It really falls under the old joke &quot;you have a problem and you decide to solve it with regex, now you have two problems&quot;. HTML is very gnarly, and regex is very gnarly. Doesn&#x27;t mean you can&#x27;t get shit done if you&#x27;re aware of the pitfalls.
bryanrasmussenalmost 8 years ago
you know when you first read that you think - damn straight you can&#x27;t parse html with regex, but as it goes on the idea gets strangely enticing. I mean maybe, with the correct rituals, and a gun and a willingness to fight with ancient evils you could maybe parse some html with regex. A sort of Lovecraft&#x2F;Action flick.
评论 #14679445 未加载
BrandoElFollitoalmost 8 years ago
I am a moderately active user of SE (~25k of flair) and I find the contrast between the regular channel (say, Stack Overflow) and the Meta one (SO Meta) horrifying.<p>The SO Meta community is such a bunch of bullies that I now hardly go there (even though I recently found two bugs which I did not bother to post). In contrast, the regular channels are pragmatically helpful (pragmatically because you still need to do some God offering sacrifices (called &quot;what effort have you put in the question&quot; and suffer some psychotic down voters). It is interesting to see that both populations are composed from the same individuals who seem to have a personality flip when switching channels.<p>I would be interested someday to learn about the dynamics of such groups. There are plenty of places on Internet populated by mentally deranged participants (cowards hiding behind Internet) but the SE Meta ones are, I belive, more educated &#x2F; intelligent in average and, sometimes, more traceable.
评论 #14686798 未加载
dukoidalmost 8 years ago
It should be possible to <i>tokenize</i> html with regular expressions, an that&#x27;s all he seems to be asking for...
kralljaalmost 8 years ago
(2009)