TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Pro scraping with Node.JS

60 pointsby chrisoharaover 14 years ago

4 comments

simonwover 14 years ago
I was intrigued to see what CSS selector engine it was using...<p><a href="https://github.com/chriso/node.io" rel="nofollow">https://github.com/chriso/node.io</a> uses <a href="https://github.com/harryf/node-soupselect" rel="nofollow">https://github.com/harryf/node-soupselect</a><p><a href="https://github.com/harryf/node-soupselect" rel="nofollow">https://github.com/harryf/node-soupselect</a> is a port of my <a href="https://github.com/simonw/soupselect" rel="nofollow">https://github.com/simonw/soupselect</a> library for Python<p><a href="https://github.com/simonw/soupselect" rel="nofollow">https://github.com/simonw/soupselect</a> is a port of my getElementsBySelector function for JavaScript: <a href="http://simonwillison.net/2003/Mar/25/getElementsBySelector/" rel="nofollow">http://simonwillison.net/2003/Mar/25/getElementsBySelector/</a><p>I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.
评论 #2132172 未加载
评论 #2132018 未加载
marcusrambergover 14 years ago
<a href="http://mojolicio.us" rel="nofollow">http://mojolicio.us</a> is way better for this kind of stuff. Here's the synopsis example redone using Mojo:<p><pre><code> $ perl -Mojo -e'g("reddit.com")-&#62;dom("a.title")-&#62;each(sub { warn shift-&#62;text })'</code></pre>
评论 #2132219 未加载
thibaut_barrereover 14 years ago
Really interesting, thanks! This will probably the first thing I will use for real projects in node.js.<p>Does anyone knows how it compares to say Nokogiri or Hpricot, both in terms of speed and in terms of ability to handle crappy html ?
chrisoharaover 14 years ago
This is in response to all the node/jsdom/jquery scraping posts that are popular lately. JSDom is hopeless for scraping - try parsing some slightly malformed HTML..
评论 #2132076 未加载
评论 #2132025 未加载
评论 #2131985 未加载