科技回声

16 条评论

STRML将近 12 年前

Don't forget streams, the more `node.js` way to parse HTML:<pre><code> var http = require('http'); var tr = require('trumpet')(); var request = require('request'); request.get('http://www.echojs.com") .pipe(tr.createReadStream("article > span")) .pipe(process.stdout); </code></pre> That's it! See <a href="https://github.com/substack/node-trumpet" rel="nofollow">https://github.com/substack/node-trumpet</a> and their tests for more.

评论 #6274712 未加载

评论 #6274542 未加载

zenocon将近 12 年前

I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.

评论 #6274650 未加载

评论 #6275410 未加载

评论 #6275196 未加载

nodesocket将近 12 年前

Have you played around with node.io? <a href="https://github.com/chriso/node.io" rel="nofollow">https://github.com/chriso/node.io</a>Encapsulates all this functionality in an easy to use interface.

评论 #6274154 未加载

评论 #6274626 未加载

nostrademons超过 11 年前

There're also Node.js bindings for Gumbo if folks want HTML5 compliance:<a href="https://github.com/karlwestin/node-gumbo-parser" rel="nofollow">https://github.com/karlwestin/node-gumbo-parser</a>It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

aroman将近 12 年前

Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: <a href="https://github.com/aroman/keeba/blob/master/jbha.coffee" rel="nofollow">https://github.com/aroman/keeba/blob/master/jbha.coffee</a>

victorhooi超过 11 年前

Hmm, interesting.I'm also looking at doing a web-scraping project with Node.js.I was going to go with CasperJS (<a href="http://casperjs.org/" rel="nofollow">http://casperjs.org/</a>), which seems fairly active and is based on PhantomJS.Their quickstart guide is actually creating a scraper:<a href="http://docs.casperjs.org/en/latest/quickstart.html" rel="nofollow">http://docs.casperjs.org/en/latest/quickstart.html</a>However, I'm wondering how this (Cheerio) compares - anybody have any experiences?

premasagar超过 11 年前

See also <a href="http://noodlejs.com" rel="nofollow">http://noodlejs.com</a> for a Node-based web scraper that also handles JSON and other file formats.It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).

dfrodriguez143超过 11 年前

I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: <a href="http://danielfrg.github.io/blog/2013/08/20/relevant-content-blog-crawler/" rel="nofollow">http://danielfrg.github.io/blog/2013/08/20/relevant-content-...</a>

chatman将近 12 年前

Isn't scrapy easier to use than this?

评论 #6274177 未加载

评论 #6274182 未加载

评论 #6274263 未加载

mholt将近 12 年前

This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: <a href="http://smartystreets.com/products/liveaddress-api/extract" rel="nofollow">http://smartystreets.com/products/liveaddress-api/extract</a>)

评论 #6274173 未加载

greenido超过 11 年前

Similar to what I wrote a week ago: <a href="http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-with-nodejs/" rel="nofollow">http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-w...</a> :)

tommoor将近 12 年前

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at <a href="http://pagemunch.com" rel="nofollow">http://pagemunch.com</a>

shospes超过 11 年前

We also used cheerio and node.js and built an click & extract interface around it: <a href="http://www.site2mobile.com/" rel="nofollow">http://www.site2mobile.com/</a>.

评论 #6278784 未加载

level09将近 12 年前

here is how I like to do it :<pre><code> from pyquery import PyQuery as pq doc = pq('http://google.com') print doc('#hplogo')</code></pre>

tectonic超过 11 年前

Remember to use SelectorGadget (<a href="http://selectorgadget.com" rel="nofollow">http://selectorgadget.com</a>) to help generate your CSS selectors.

zerni将近 12 年前

nice!I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: <a href="https://gist.github.com/zerni/6337067" rel="nofollow">https://gist.github.com/zerni/6337067</a>Unfortunately jsdom had a memory leak so the crawler died after a while...

评论 #6274893 未加载

16 条评论

STRML将近 12 年前

评论 #6274712 未加载

评论 #6274542 未加载

zenocon将近 12 年前

评论 #6274650 未加载

评论 #6275410 未加载

评论 #6275196 未加载

nodesocket将近 12 年前

评论 #6274154 未加载

评论 #6274626 未加载

nostrademons超过 11 年前

aroman将近 12 年前

victorhooi超过 11 年前

premasagar超过 11 年前

dfrodriguez143超过 11 年前

chatman将近 12 年前

Isn't scrapy easier to use than this?

评论 #6274177 未加载

评论 #6274182 未加载

评论 #6274263 未加载

mholt将近 12 年前

评论 #6274173 未加载

greenido超过 11 年前

tommoor将近 12 年前

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at <a href="http://pagemunch.com" rel="nofollow">http://pagemunch.com</a>

shospes超过 11 年前

We also used cheerio and node.js and built an click & extract interface around it: <a href="http://www.site2mobile.com/" rel="nofollow">http://www.site2mobile.com/</a>.

评论 #6278784 未加载

level09将近 12 年前

here is how I like to do it :<pre><code> from pyquery import PyQuery as pq doc = pq('http://google.com') print doc('#hplogo')</code></pre>

tectonic超过 11 年前

Remember to use SelectorGadget (<a href="http://selectorgadget.com" rel="nofollow">http://selectorgadget.com</a>) to help generate your CSS selectors.

zerni将近 12 年前

评论 #6274893 未加载

Use Node.js to Extract Data from the Web

16 条评论

Use Node.js to Extract Data from the Web

16 条评论