科技回声

6 条评论

pshc超过 13 年前

I was scraping with jQuery for a while but it felt like an awful lot of overhead. In the case of simpler scraping tasks that happen a lot I've actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom state machine that only accumulates the data I want. At no time is any DOM node actually created in memory, let alone the entire DOM tree. It means I feel safer running many of these in parallel on a VPS. It also means I can write a nice streaming API where you start emitting data the moment you get enough input. Buffering input just feels wrong in node.js.<p>But jQuery is a great scraper if your transformation is complex and non-streamable. [1] <a href="https://github.com/aredridel/html5" rel="nofollow">https://github.com/aredridel/html5</a>

ricardobeat超过 13 年前

<p><pre><code> doc.find('h2:gt(0)').before('<hr />')</code></pre>

peteretep超过 13 年前

Actually, I'm doing this for my SUPER SECRET startup at the moment. Originally the front-end would just send the back-end the whole HTML of a user's page when they executed the browser plugin, and the back-end would intercept it and knock it up in Perl.<p>Wasn't sure how well that was going to scale, and was worried people would get weird about sending the entire contents of the page they're on - I have a 90% working solution now where it's all done in-browser, with a bunch of classes I've been working on with a node.js set of testing tools

bialecki超过 13 年前

One of my biggest pet peeves with crawling the web is using XPath. Not because I have strong feelings about XPath, just that I use css selector syntax so much, it's a pain I can't leverage that knowledge in this domain as well. Something like this is really awesome and going to make crawling the web more accessible.

评论 #3236146 未加载

评论 #3235816 未加载

orc超过 13 年前

Wow, I was just thinking this morning how awesome it would be to make a desktop app that could crawl websites with jquery. And since node.js has a windows installer, it sounds like a much better solution than the C# HtmlAgilityPack I've been using.

评论 #3237205 未加载

slashclee超过 13 年前

Apparently node.js doesn't implement the DOMParser object, which means that you can't actually use jquery's parseXML method. That's a bummer :(

6 条评论

pshc超过 13 年前

ricardobeat超过 13 年前

<p><pre><code> doc.find('h2:gt(0)').before('<hr />')</code></pre>

peteretep超过 13 年前

bialecki超过 13 年前

评论 #3236146 未加载

评论 #3235816 未加载

orc超过 13 年前

评论 #3237205 未加载

slashclee超过 13 年前

Apparently node.js doesn't implement the DOMParser object, which means that you can't actually use jquery's parseXML method. That's a bummer :(

HTML/XML Parsing with Node & jQuery

6 条评论

HTML/XML Parsing with Node & jQuery

6 条评论