I was intrigued to see what CSS selector engine it was using...<p><a href="https://github.com/chriso/node.io" rel="nofollow">https://github.com/chriso/node.io</a> uses <a href="https://github.com/harryf/node-soupselect" rel="nofollow">https://github.com/harryf/node-soupselect</a><p><a href="https://github.com/harryf/node-soupselect" rel="nofollow">https://github.com/harryf/node-soupselect</a> is a port of my <a href="https://github.com/simonw/soupselect" rel="nofollow">https://github.com/simonw/soupselect</a> library for Python<p><a href="https://github.com/simonw/soupselect" rel="nofollow">https://github.com/simonw/soupselect</a> is a port of my getElementsBySelector function for JavaScript: <a href="http://simonwillison.net/2003/Mar/25/getElementsBySelector/" rel="nofollow">http://simonwillison.net/2003/Mar/25/getElementsBySelector/</a><p>I'm always surprised to see that code still being used - it's the least complete selector library out there by a long way.
<a href="http://mojolicio.us" rel="nofollow">http://mojolicio.us</a> is way better for this kind of stuff. Here's the synopsis example redone using Mojo:<p><pre><code> $ perl -Mojo -e'g("reddit.com")->dom("a.title")->each(sub { warn shift->text })'</code></pre>
Really interesting, thanks! This will probably the first thing I will use for real projects in node.js.<p>Does anyone knows how it compares to say Nokogiri or Hpricot, both in terms of speed and in terms of ability to handle crappy html ?
This is in response to all the node/jsdom/jquery scraping posts that are popular lately. JSDom is hopeless for scraping - try parsing some slightly malformed HTML..