TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Use Node.js to Extract Data from the Web

80 点作者 johnrobinsn将近 12 年前

16 条评论

STRML将近 12 年前
Don&#x27;t forget streams, the more `node.js` way to parse HTML:<p><pre><code> var http = require(&#x27;http&#x27;); var tr = require(&#x27;trumpet&#x27;)(); var request = require(&#x27;request&#x27;); request.get(&#x27;http:&#x2F;&#x2F;www.echojs.com&quot;) .pipe(tr.createReadStream(&quot;article &gt; span&quot;)) .pipe(process.stdout); </code></pre> That&#x27;s it! See <a href="https://github.com/substack/node-trumpet" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;substack&#x2F;node-trumpet</a> and their tests for more.
评论 #6274712 未加载
评论 #6274542 未加载
zenocon将近 12 年前
I&#x27;ve done a considerable amount of scraping; if you&#x27;re poking around at nicely designed web pages, node&#x2F;cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w&#x2F;in iframes and forms buried 6 posts deep (inside iframes with quirks), I&#x27;d use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.
评论 #6274650 未加载
评论 #6275410 未加载
评论 #6275196 未加载
nodesocket将近 12 年前
Have you played around with node.io? <a href="https://github.com/chriso/node.io" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;chriso&#x2F;node.io</a><p>Encapsulates all this functionality in an easy to use interface.
评论 #6274154 未加载
评论 #6274626 未加载
nostrademons超过 11 年前
There&#x27;re also Node.js bindings for Gumbo if folks want HTML5 compliance:<p><a href="https://github.com/karlwestin/node-gumbo-parser" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;karlwestin&#x2F;node-gumbo-parser</a><p>It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo&#x27;s parser is more spec-compliant.
aroman将近 12 年前
Cheerio is really really awesome. I&#x27;ve used it to build a considerably sophisticated web scraping backend to wrap my school&#x27;s homework website and re-expose&#x2F;augment via node&#x2F;mongo&#x2F;backbone&#x2F;websockets.<p>There are definitely some bugs in cheerio if you&#x27;re looking to do some really fancy selector queries, but for the most part it&#x27;s extremely performant and pleasant to use.<p>If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it&#x27;s open source: <a href="https://github.com/aroman/keeba/blob/master/jbha.coffee" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;aroman&#x2F;keeba&#x2F;blob&#x2F;master&#x2F;jbha.coffee</a>
victorhooi超过 11 年前
Hmm, interesting.<p>I&#x27;m also looking at doing a web-scraping project with Node.js.<p>I was going to go with CasperJS (<a href="http://casperjs.org/" rel="nofollow">http:&#x2F;&#x2F;casperjs.org&#x2F;</a>), which seems fairly active and is based on PhantomJS.<p>Their quickstart guide is actually creating a scraper:<p><a href="http://docs.casperjs.org/en/latest/quickstart.html" rel="nofollow">http:&#x2F;&#x2F;docs.casperjs.org&#x2F;en&#x2F;latest&#x2F;quickstart.html</a><p>However, I&#x27;m wondering how this (Cheerio) compares - anybody have any experiences?
premasagar超过 11 年前
See also <a href="http://noodlejs.com" rel="nofollow">http:&#x2F;&#x2F;noodlejs.com</a> for a Node-based web scraper that also handles JSON and other file formats.<p>It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).
dfrodriguez143超过 11 年前
I like to use the readability API so I don&#x27;t need to see the HTML of every single site. I did an example here: <a href="http://danielfrg.github.io/blog/2013/08/20/relevant-content-blog-crawler/" rel="nofollow">http:&#x2F;&#x2F;danielfrg.github.io&#x2F;blog&#x2F;2013&#x2F;08&#x2F;20&#x2F;relevant-content-...</a>
chatman将近 12 年前
Isn&#x27;t scrapy easier to use than this?
评论 #6274177 未加载
评论 #6274182 未加载
评论 #6274263 未加载
mholt将近 12 年前
This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: <a href="http://smartystreets.com/products/liveaddress-api/extract" rel="nofollow">http:&#x2F;&#x2F;smartystreets.com&#x2F;products&#x2F;liveaddress-api&#x2F;extract</a>)
评论 #6274173 未加载
greenido超过 11 年前
Similar to what I wrote a week ago: <a href="http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-with-nodejs/" rel="nofollow">http:&#x2F;&#x2F;greenido.wordpress.com&#x2F;2013&#x2F;08&#x2F;21&#x2F;yahoo-finance-api-w...</a> :)
tommoor将近 12 年前
I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at <a href="http://pagemunch.com" rel="nofollow">http:&#x2F;&#x2F;pagemunch.com</a>
shospes超过 11 年前
We also used cheerio and node.js and built an click &amp; extract interface around it: <a href="http://www.site2mobile.com/" rel="nofollow">http:&#x2F;&#x2F;www.site2mobile.com&#x2F;</a>.
评论 #6278784 未加载
level09将近 12 年前
here is how I like to do it :<p><pre><code> from pyquery import PyQuery as pq doc = pq(&#x27;http:&#x2F;&#x2F;google.com&#x27;) print doc(&#x27;#hplogo&#x27;)</code></pre>
tectonic超过 11 年前
Remember to use SelectorGadget (<a href="http://selectorgadget.com" rel="nofollow">http:&#x2F;&#x2F;selectorgadget.com</a>) to help generate your CSS selectors.
zerni将近 12 年前
nice!<p>I did a webcrawler with node.js myself last year. It&#x27;s only a quick try but you can find the worker class here: <a href="https://gist.github.com/zerni/6337067" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;zerni&#x2F;6337067</a><p>Unfortunately jsdom had a memory leak so the crawler died after a while...
评论 #6274893 未加载