TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Oh Yes You Can Use Regexes to Parse HTML

38 点作者 luuuzeta将近 2 年前

11 条评论

BiteCode_dev将近 2 年前
So he is using a full blown parser, but some part of the tokenisation is done with regexes.<p>I call BS.<p>Also I&#x27;m pretty sure it will miss some nesting of &quot;&lt;&quot;, somewhere, in an attribute, cdata, js, etc, that is not a tag, but will confuse the parser.<p>I used regexes to parse HTML, it works fine for quick and dirty scripts that need a small chunk of data for a limited sample of pages. Which I believe is the message he is trying to convey.<p>But I&#x27;d rather keep the legend of the infamous SO post against parsing HTML because:<p>- it will help the people that need it the most to avoid making mistakes<p>- it&#x27;s fun, and part of our culture.
评论 #35980213 未加载
Name_Chawps将近 2 年前
This &quot;uses&quot; regexes to parse HTML in the same way that Sunny D is &quot;made with&quot; 100% orange juice.
评论 #35977284 未加载
egberts1将近 2 年前
I know of no Regex pattern that can handle all the old and new HTML as well as HTML5: believe me, as one who is looking to put HTML parser on FPGA&#x2F;ASIC for higher speed, I&#x27;ve actually forayed down this rabbit hole a few times in the fruitless pursuit of identifying this elusive pure Regex pattern for HTML, et. al. Problem is in Regex&#x27;s lack of support for multiple state machine and its needed interactions between these state machines.<p>The language Perl came closest to the smallest HTML parser.<p>Things to do before doing simplistic regex on HTML using some multiple passes of Regex are probably required, probably in order of (my 20yo memory failing here):<p>- de-CDATA<p>- De-pairing of quotes<p>- De-symbolization of HTML symbols, entities. and codes (de-escaping)<p>- lone unterminated &lt;&#x2F;&gt; (ie. &lt;p&gt;)<p>Before you can even hit up for pairing of &lt;XXX&gt; and &lt;&#x2F;XXX&gt; and getting to its HTML tags and attributes.<p>In short, additional scripting is required to conduct the applying of multiple Regex patterns before one can even be getting into properly parsing the HTML.<p>Simplest that I&#x27;ve gotten is using both bash logic and Regex, but it fails on certain HTML codes.<p>Federico Tommassetti, well-renown expert on domain specific languages and transpiliers, covers nearly all the valid libraries of many modern languages for just the parsing of HTML.<p>Federico makes it easier for first timer of HTML parser coding to that that first step: selecting an HTML parser library.<p><a href="https:&#x2F;&#x2F;tomassetti.me&#x2F;parsing-html&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tomassetti.me&#x2F;parsing-html&#x2F;</a>
评论 #35984653 未加载
jerf将近 2 年前
First, there&#x27;s the obvious problem of failing to distinguish between &quot;parsing&quot; and merely &quot;tokenizing&quot;. The latter was generally possible. In fact, IIRC, the famous Zalgo rant (linked in the post), while fun and true in a sense, is actually posted to a bad question for it, as the question asked is actually perfectly solvable by regular expressions, even conventional ones without backwards matches or any other fancy PCRE additions.<p>However, I&#x27;m not even sure that you can any longer even tokenize HTML with regular expressions, because one of the most important aspects of HTML5 was to formalize a strict definition of how to sloppily parse HTML. Yes, that may sound like a contradiction, but it isn&#x27;t, check the sentence again. It formalized what the browsers were already doing and harmonized how to handle the broken HTML that people actually produce. As one might expect from something that is the harmonization of the decade+ accumulation of the heuristics developed by at least three major streams of browsers (more depending on how you count), it is not exactly simple.<p>I guess I can&#x27;t guarantee you couldn&#x27;t embed all this into a regular expression: <a href="https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;parsing.html#parse-state" rel="nofollow">https:&#x2F;&#x2F;html.spec.whatwg.org&#x2F;multipage&#x2F;parsing.html#parse-st...</a> but the result would not be worth it. Use a standard HTML parser.<p>Now, obviously, I&#x27;m taking a strict view of the term &quot;HTML&quot; in this case. Regular expressions can certainly be used to extract things from documents that you choose to view as a particular approximation of HTML. I&#x27;ve done it before and I&#x27;ll probably do it again. But when I do, I&#x27;m not actually envisioning myself as &quot;parsing HTML&quot;, what I&#x27;m doing is parsing a byte stream that happens to be HTML, but I&#x27;m just hacking around and getting something that works for the exact format this particular document happens to be in, which is a highly, <i>highly</i> restricted subset of HTML, especially since I probably only care about a very small part of it. But it&#x27;s also an unspecified subset of HTML and may change without warning at any time, and I need to deal with that.<p>If I care about a lot of it, I find myself an HTML parser and an XPath implementation. If you do this a lot, it&#x27;s worth learning, as it&#x27;s very, very powerful and faster to develop with than regexes once you know what you&#x27;re doing. If it&#x27;s anything beyond the most trivial thing, I preferentially reach for this now that I&#x27;ve learned it. But there is a non-trivial learning curve to it. If you&#x27;re just grabbing a particular price out of a page once, by all means use regexs.
gigel82将近 2 年前
That&#x27;s not really parsing HTML; well, I guess it is technically speaking parsing it, but most people understand building a tree (DOM) when they think of parsing HTML and that&#x27;s not what those regex programs do.
Tainnor将近 2 年前
HTML is not regular, so it can&#x27;t be recognised by a &quot;theoretical&quot; regular expression, such as introduced in a theoretical CS class. Modern regex engines however, are more powerful and can recognise non-regular languages too.<p>Then there&#x27;s a distinction to be made between recognising a language and parsing it.<p>This article goes into more detail: <a href="https:&#x2F;&#x2F;www.npopov.com&#x2F;2012&#x2F;06&#x2F;15&#x2F;The-true-power-of-regular-expressions.html" rel="nofollow">https:&#x2F;&#x2F;www.npopov.com&#x2F;2012&#x2F;06&#x2F;15&#x2F;The-true-power-of-regular-...</a>
jove_将近 2 年前
As everyone has pointed out, this does not count. Note that the idea that regex can&#x27;t parse html is specific and proven. What it means is that you can&#x27;t write an expression that matches both the opening and matching closing tags. There&#x27;s no way to handle nested tags within a single regex. It&#x27;s only possible to write a regex that matches up to a finite nesting limit.
评论 #35980598 未加载
评论 #35980344 未加载
valbaca将近 2 年前
&quot;You cannot make an alcoholic drink with water.&quot;<p>OP: &quot;Oh Yes You Can Use Water to Make A Hard Drink. AH! But if I freeze water and pour in whiskey, I&#x27;ve <i>used</i> water to make an alcoholic drink.&quot;<p>-.-
评论 #35981588 未加载
wantguns将近 2 年前
see also: <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;1732454" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;1732454</a>
wodenokoto将近 2 年前
The default engine in beautiful is&#x2F;was &quot;regex engine&quot;. Just saying.
warrenm将近 2 年前
<i>Can?</i><p>Yep<p><i>Should?</i><p>Most likely .. no :)