TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: A fast, Rust HTML parser that works?

2 点作者 thatxliner大约 2 年前
So I&#x27;m doing some web scraping in Rust, and so I will need to parse HTML. [scraper](https:&#x2F;&#x2F;docs.rs&#x2F;scraper&#x2F;latest&#x2F;scraper&#x2F;) (which uses [html5ever](https:&#x2F;&#x2F;github.com&#x2F;servo&#x2F;html5ever)) is doing fine except that it&#x27;s the bottleneck of my application.<p>So I need a faster parser. I&#x27;ve tried [tl](https:&#x2F;&#x2F;docs.rs&#x2F;tl&#x2F;latest&#x2F;tl&#x2F;) which would&#x27;ve been perfect except that <i>it doesn&#x27;t actually work</i> on the HTML I have. When I try to `query_selector` the elements I need, it returns nothing.<p>[Kuchiki](https:&#x2F;&#x2F;docs.rs&#x2F;kuchiki&#x2F;latest&#x2F;kuchiki&#x2F;) is abandonded.<p>I couldn&#x27;t figure out how to get [lol-html](https:&#x2F;&#x2F;github.com&#x2F;cloudflare&#x2F;lol-html) to work for me (it&#x27;s designed for <i>re-writing HTML</i>, whatever that means). It doesn&#x27;t seem to have an API to extract the inner text of an element.<p>[html5gum](https:&#x2F;&#x2F;github.com&#x2F;untitaker&#x2F;html5gum) seems to be just an HTML tokenizer, or otherwise just too low-level. I <i>have not</i> yet tried [quick-xml](https:&#x2F;&#x2F;github.com&#x2F;tafia&#x2F;quick-xml&#x2F;) but judging from the README, it&#x27;s pretty low-level too. I mean, if these are the only options left then I will try them. Otherwise, I would love to use a parser that&#x27;s faster but as ergonomic as `scraper` or `tl`.<p>At this point, I would be happy with an Lxml bridge&#x2F;port of some sort. I don&#x27;t need to mutate HTML, just parse and read data from it.

1 comment

necovek大约 2 年前
lxml is a wrapper around libxml2.<p>If you are after libxml2 performance, you can always make use of <a href="https:&#x2F;&#x2F;docs.rs&#x2F;libxml&#x2F;latest&#x2F;libxml&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.rs&#x2F;libxml&#x2F;latest&#x2F;libxml&#x2F;</a>
评论 #34915229 未加载