The section of the WHATWG HTML spec about parsing XHTML begins with this note:<p><i>> An XML parser, for the purposes of this specification, is a construct that follows the rules given in XML to map a string of bytes or characters into a Document object.</i><p><i>> Note: At the time of writing, no such rules actually exist.</i><p>What do the authors of HTML mean by this? Isn't there a spec for XML? There is -- here's what it has to say about comments (<a href="https://www.w3.org/TR/xml/#sec-comments" rel="nofollow">https://www.w3.org/TR/xml/#sec-comments</a>):<p><pre><code> Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
</code></pre>
The HTML spec, on the other hand, writes out the token state machine explicitly. There are ten states involved with parsing comments; here's one (<a href="https://html.spec.whatwg.org/multipage/parsing.html#comment-state" rel="nofollow">https://html.spec.whatwg.org/multipage/parsing.html#comment-...</a>):<p><pre><code> 12.2.5.45 Comment state
Consume the next input character:
U+003C LESS-THAN SIGN (<)
Append the current input character to
the comment token's data. Switch to
the comment less-than sign state.
U+002D HYPHEN-MINUS (-)
Switch to the comment end dash state.
U+0000 NULL
This is an unexpected-null-character
parse error. Append a
U+FFFD REPLACEMENT CHARACTER
character to the comment token's data.
EOF
This is an eof-in-comment parse error.
Emit the comment token. Emit an end-
of-file token.
Anything else
Append the current input character to
the comment token's data.
</code></pre>
The spec defines what to do for every character, even characters that should not appear in valid HTML. An HTML parser will behave exactly the same as another HTML parser in all circumstances.<p>You can see the success of this approach on the real web; inconsistent HTML parsing between browsers is no longer the issue it used to be 15 years ago. It may be more work to write, but I wish HTML's precise, step-by-step format was more common. Writing a spec as a list of rules makes it easier to implement (as a first pass, you can just go line-by-line and translate it to code) and reduces the chance of inconsistencies like the one in the article (and their associated security implications).