Solving the regex of madness, and snarky answers on StackOverflow (2019)

117 pointsby monortabout 4 years ago

21 comments

inopinatusabout 4 years ago

> The question is about finding opening tags in XHTML using a regular expressionBzzzt, wrong, sorry! The question is about finding open tags in the presence of XHTML self-closing tags. That difference alone places these interpretations gulfs apart. But there’s more: it does not specify that the input document is even XHTML, only that XHTML-style self-closing elements may be present. In fact the original question was barely minutes old and tagged merely “regex” when that famous answer was written in 2009; the question was not tagged with “xhtml” until 2012, and not by the original author either.Revealingly, then, if we review the broader context (i.e question history) of the original question author, it’s clear that yes indeed they were trying to fix a malformed document, and in particular to normalise it into XHTML, with focus on fixing up any so-called “dangling tags”. For this task, the suggestion of “use a parser” is indeed sound advice.The real moral here is, don’t be a jerk about the precise semantics of a question, look at what the person needs, and help them ask better questions.Otherwise, you’re just gonna discover that there’s always a bigger jerk, and they’re on Stack Overflow, moderating your stuff.

评论 #27100295 未加载

评论 #27099162 未加载

评论 #27100061 未加载

评论 #27098499 未加载

评论 #27096509 未加载

shmageggyabout 4 years ago

It's not explicitly stated, but I believe the author's point is that the original question didn't require a recursive solution (because it's only asking about individual tags, not matching opening tags with their closing partners)Edit: yes looking at the answers, someone pointed this out in a comment response to the"Chomsky" answer:> The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG. – LarsH Mar 2 '12 at 8:43

评论 #27096091 未加载

NtrllyIntrstdabout 4 years ago

The author seems to be missing the point, in my opinion. While it is certainly true that often one can solve simple, seemingly innocent sub-problems within more general languages, the transitions from "I see I can solve this simple program with regex'es!" to "Then I can probably solve this other, almost identical problem as well!" and have the problem explode right into your face are subtle (almost imperceivable to a novice) and it would be a more robust solution to go for the right tools (i.e. an (x)html parser), as well as a good learning example. On a side note: regular expressions can not - by definition - parse recursive languages. A regular expression matcher that does is not a regular expression parser but an ugly-duckling in the family of context-free grammar matchers. People should learn when and how to use those.

评论 #27095289 未加载

评论 #27095652 未加载

评论 #27095312 未加载

dataflowabout 4 years ago

Try that regex on<pre><code> < script> console.log("<script2>"); </script> </code></pre> Edit 1: I'm unsure if the inner <script2> is valid (X)HTML, so it might not be an issue of being unable to parse correct (X)HTML, but rather an issue of being unable to detect invalid (X)HTML. (Can someone verify?)Edit 2: It seems Chrome chokes on the space... does anyone know if the initial space is valid? I'm pretty sure I've seen parsers that accept it...

评论 #27095962 未加载

评论 #27095054 未加载

评论 #27095351 未加载

评论 #27095370 未加载

chubotabout 4 years ago

This conversation would be a lot clearer with a distinction between "regexes" and "regular languages". The former is what Perl/Python/etc. have, and the latter is a mathematical concept (and automata-based non-backtracking engines like RE2, re2c, and rust/regex are closer to this set-based definition).<a href="https://www.oilshell.org/blog/2020/07/eggex-theory.html" rel="nofollow">https://www.oilshell.org/blog/2020/07/eggex-theory.html</a>With those definitions, this part of the snarky answer is wrong:HTML is not a regular language and hence cannot be parsed by regular expressionsThat is, regular expressions as found in the wild can parse more than regular languages. (And that does happen to be useful in the HTML case!)This answer is also irrelevant, since the poster is asking for a solution with regexes, NOT regular languages:I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).In this post, the example given IS a regex, but it IS NOT a regular language:<pre><code>  # comment </code></pre> The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.I gave my analysis here and listed 3 or 4 caveats: <a href="https://news.ycombinator.com/item?id=26359556" rel="nofollow">https://news.ycombinator.com/item?id=26359556</a>I prefer to use regular languages and an explicit stack, but this is not really what the original question was asking.

评论 #27099425 未加载

评论 #27105929 未加载

jancsikaabout 4 years ago

I love seeing the weirdo CDATA thingy in there! CDATA ftw!E.g., you've got this enormous spec for SVG which includes CSS, but that CSS has syntax inside a style tag which could break XHTML parsers.Amateurs out there are probably thinking, "Well, why not just compromise in the spec and tell implementers to do the same thing that HTML does to parse style tags?" Well, professionals know that cannot work for myriad reasons you can read about if you take out a college loan and remain sedentary for the required duration.The right approach is to throw the CSS stuff inside CDATA tags to tell the parser not to parse it so things don't break. That is the way sensible, educated professionals solve this problem.I'm only kidding!For inline SVGs the HTML5 parser simply says, "Parse this gunk as HTML5, and use sane defaults to interpret the parsed junk in the correct svg namespace so that all the child thingies in that namespace just work."Which it does.Unless you're going to grab the innerHTML of the inline SVG and shove it into a file to be used later as an SVG image.In that case you cross the invisible county line into XHTML territory where the sheriff is waiting to throw you in jail for violating the CDATA rule. In that case the XHTML parser hidden in the guts of the browser doles out the justice of an error in place of your image. Because that is the way sensible, educated professionals solve this problem. :)My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.

评论 #27098049 未加载

评论 #27096339 未加载

lifthrasiirabout 4 years ago

The only part I agree in this writing is that you don't need to be snarky to be correct. (I'd like to introduce the XY problem of the second kind, where the answerer is so confident that it is the answerer who have missed the actual question.)Some regexes can recognize a language beyond the regular language. They are typically available in two flavors: recursive references (Perl, Ruby, PCRE) and stackable captures (.NET). They are obscure enough that I would not recommend them, but it is patently false that regular expressions (EDIT: of the practical interest) cannot be recursive.It is possible to match individual HTML tags with regexes, but it is difficult. It cannot use a bare `\w` or `\s` because both XML/XHTML and HTML5 parsers have peculiar definitions for tag name characters and space characters. For example your `\s` will typically match various Unicode space characters, while only ASCII whitespaces are recognized in tags. There are also several notable exceptions to the parser (and external states termed the "tree construction"), so missing any of them would result in an immediate XSS. If you think you can write a correct regex for HTML tags, my quizzes [1] should make you concerned. Limiting the question to XHTML does alleviate some but not all concerns.The distinction between recognition and parsing is correct, but parsing doesn't necessarily mean the reconstruction of parse tree. Parsing means the access to constituent nonterminals, which can be used to reconstruct parse tree but also directly used as their own (e.g. calculators). Indeed in most regex implementations you can't extract two or more strings out of each capture (Raku is a notable exception), so you can match against e.g. `(\w+)(?:,(\w+))*` but can't extract a list of comma-separated words with it. Practically speaking this means you can't extract a list of attributes with a single regex, making it unsuitable for HTML parsing anyway.[1] <a href="https://news.ycombinator.com/item?id=26355451" rel="nofollow">https://news.ycombinator.com/item?id=26355451</a>

评论 #27095122 未加载

jll29about 4 years ago

> I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).Note that regarding formal language and complexity theory, while it is correct that in general, arbitrary nested structures require a context free grammar (type 2 in the Chomsky hierarchy) and are thus beyond regular (type 3) [1], this statement is NOT true _if_ you limit the nesting depth with a finite constant k.For example, if you agree to an HTML tag maximum nesting depth of, say, 100, then it can be modeled with a regular (type 3) grammar, including correct required matching of opening and closing tags, and hence you can write a regular expression that matches it as well.This debate is well-documented in the theoretical linguistics literature, where some say human languages are not regular because you can always embed yet another additional relative clause in any sentence in principle without adversely affecting grammaticality, whereas others say while you could you won't find natural examples in human-written text documents where extreme nesting depth is actually found. At that point psycholinguists and theoretical linguists usually start a fight about whether memory limits are important or "just performance as opposed to competence".(Goes to show how practical solid theory is.)[1] <a href="https://www.sciencedirect.com/science/article/pii/S0019995859903626/pdf?md5=9d466f851651bd592afa5ee561b7a0b0&pid=1-s2.0-S0019995859903626-main.pdf" rel="nofollow">https://www.sciencedirect.com/science/article/pii/S001999585...</a>

nooyurrsdeyabout 4 years ago

This is one of those things that people will debate about endlessly and ultimately it feels so silly.The poster asked how to do it, and this person provided a practical regex to cover most (if not all) cases.Everything else is just pedantic debate.

评论 #27097267 未加载

adamentabout 4 years ago

Does the proposed regular expression really handle embedded script content correctly? From my limited understanding of HTML, pretty much only </script> counts as closing the script contents and everything else is treated as part of the script.

评论 #27095343 未加载

评论 #27095066 未加载

cyberdelicaabout 4 years ago

It goes to show, how few people are able to think for themselves.The question originally asked, is "how to match HTML tags". Not how to parse. Not how to scrape. Simply "match". To which I would say, regex is perfectly suited to the task.Furthermore - if one simply needs to scrape content, regex is again, perfectly suited. Scraping, is not parsing - and has no real need for a full blown DOM parsing library.Cargo cult parrots like to say - if the HTML content changes, then one's regex will fail. Well, so will one's DOM parser.

motoboiabout 4 years ago

The problem is: you cannot parse malformed (real, everyday) html with regexes.But if you need to parse html even malformed) generated by the same template (like a scrapping situation), the whole file becomes regular, which can be parsed by a regular expression.But if you try to parse html in general, too bad because then you’ll need to take html in consideration and will need a recursive descent parser, not a regex.This question popped up so many times in forums in 2000’s that people got mad at that.

math-devabout 4 years ago

Great article (assuming the solution provided works).I do a lot of parsing in my projects, I find natural text based input vital for power users who don’t to point and click always.What are some good parsing algorithms, theoretical articles etc to help me become more professional in the parsing tools I write?

评论 #27095046 未加载

03b17999-4268about 4 years ago

I haven't read the rest of the thread, but the article is even wronger than the glib replies there. HTML needn't be well formed. There are adhoc rules which let major browsers parse broken HTML. If you do not follow the spec to the letter you will have your tooling break on input that every browser thinks is acceptable.Which is why you always use whatever html parsing library comes with your language. There is no simple answer in the thread because there is no simple answer in the real world.That said, anyone who says:>It is quite possible and not even that difficult:<pre><code> ( # match all tags in XHTML but capture only opening tags  # comment | <!\[CDATA\[ .*? \]\]> # CData section | <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > | <\? .*? \?> # xml declaration or processing instruction | < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > # opening tag - captured | </ \w+ \s* > # end tag ) </code></pre> Should be seriously mentored by someone.

评论 #27095136 未加载

评论 #27095380 未加载

评论 #27098597 未加载

评论 #27095367 未加载

asddubsabout 4 years ago

problem is that it doesn't reflect how browsers parse things. if you were using this in a security context, e.g., here's an example it won't detect (granted this is not technically valid, but does it matter?):<div "> put arbitrary html here as you please (using single quotes for attributes)<div ">

SavantIdiotabout 4 years ago

Oh I've seen this many times in different forms. Especially with regexes.You know what this is a great example of? A case where hacking makes a mess, and thinking before coding solves the problem.The madness comes from using the wrong tool for the problem. Yes, you can hack a regex to parse XHTML this might be "good enough", but it is more robust, cleaner and easier to explain if you use a lexical tokenizer and a grammar.The lure is an illusion that comes from an initial effort assessment. Where the effort to hack a quick-and-dirty regex (call this Ehack) vs a "oh, man, you mean I gotta think about the problem" (call this Ethink) appears as "Equick <<< Ethink." However, it soon evolves to the scenario where "Equick >>> Ethink," driven by the thought process, "I'm almost there, this regex just needs one more tweak." Aka, the gambler's fallacy: it comes into play and the sunk costs are ignored.TL;DR - Use the right tool for the problem, even if it means a slightly larger up-front effort investment.

评论 #27098642 未加载

ameliusabout 4 years ago

1. determine size of XHTML input2. build regex that works up to the size determined in step 1.3. apply regex

funyunpowderabout 4 years ago

based on the stackoverflow thread, and then the comments here, an interesting research paper topic would be 'why do people get so passionate about regex'

AzzieElbababout 4 years ago

Long regexes are the root of all evil

评论 #27099803 未加载

评论 #27095454 未加载

评论 #27095625 未加载

BiteCode_devabout 4 years ago

Weird article that basically says people are wrong then prove they are right.

tester756about 4 years ago

uhh?just because you can doesn't mean you shouldjust take a look at proposed regex>(> # match all tags in XHTML but capture only opening tags>  # comment> | <!\[CDATA\[ .? \]\]> # CData section> | <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* >> | <\? .? \?> # xml declaration or processing instruction> | < \w+ ( "" [^""] "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag> | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > # opening tag - captured> | </ \w+ \s* > # end tag> )it's ugly as hell>Parsing typically uses (at least) two steps: Tokenization which uses regular expressions to splits the input string into a sequence of syntax elementsI don't use regex for tokenization, I'm doing something wrong?But overall I think this is important post, even despite I believe that regex is the best example of "good idea, shitty API"