TechEcho

9 comments

sophaclesabout 13 years ago

Somewhat tangential, here is something I have long thought would be a very useful project, but unfortunately haven't had the time to build:It is a scraping/crawling tool suite. Base it on webkit, with good scriptable plugin support (not just js, but expose the DOM to other languages too). It would consist of a few main parts.1) what I call the Scrape-builder. This is essentially a fancy web browser, but, it has a rich UI that can be used to select portions of a web page, and expose the appropriate DOM elements, and how to find those elements in the page. By expose, I mean put into some sort of editor/ide - It could be raw html, or some sort of description language. in the editor, the elements one would want to scrap can then be selected, and put into some sort of object for later processing. This can include some form of scripting to mangle the data as needed. It can also include interactions with the javascript on the page, recording click macros (well event firing, and such). The point of this component is to allow content experts/non- or novice-programmers to easily arrange for the "interesting" data to be selected for scraping.2) The second component for the suite is a scraping engine. It uses the description + macros + scripts from the Scrape-builder to actually pull data from the pages, and turn them into data objects. These objects can then be put on a queue for later processing with backend systems/code. The scraping engine is basically a stripped down webkit without the rendering/layout/display bits compiled in. It just builds the dom and executes the page's javascript to ultimately scrape the bits selected. This is driven by the spidering engine.3) The spidering engine is what determines which pages to point the scraping engine at. It can be fed by external code, or it can be fed by scripts from the scraping engine, a feedback mechanism (some links on a page may be part of a single scraping, some may just be fodder for a later scraping). It can be thought of as a work queue for the scraping engines.The main use-cases I see for this are specialized search and aggregation engines, which want to get at the interesting bits of sites which don't expose a good api, or where the data may be formatted, but hard to semantically infer without human intervention. Sure, it wouldn't be as efficient from a code execution point of view as say, custom scraping scripts, but it would allow for much faster response times to page changes, and allow better use of programmer time, by taking care of a lot of boilerplate or almost-boiler plate parts of scraping scenarios.

评论 #3681760 未加载

评论 #3680685 未加载

lancefisherabout 13 years ago

The problem I've had with using jsdom in scraping web pages is that it is not very forgiving about bad HTML. There are so many pages in the wild that have malformed HTML, and jsdom just pukes on it. I started using Apricot [1] which uses HtmlParser [2] which has been better. I'd like to hear what others are using to scrape bad webpages.[1] <a href="https://github.com/silentrob/Apricot" rel="nofollow">https://github.com/silentrob/Apricot</a>[2] <a href="https://github.com/tautologistics/node-htmlparser" rel="nofollow">https://github.com/tautologistics/node-htmlparser</a>

评论 #3681478 未加载

评论 #3681433 未加载

评论 #3680830 未加载

lopatinabout 13 years ago

Good overview. I've been a fan of node.io for node scraping for a while. Lot's of stuff built in and you don't lose your jQuery selectors.

评论 #3680077 未加载

nchuhoaiabout 13 years ago

<a href="http://nokogiri.org/" rel="nofollow">http://nokogiri.org/</a>Perfectly fine use of css selectors and more

tcarnellabout 13 years ago

Thanks for sharing. This is a bit off-topic, but if you are interested in Scraping Web Pages, you might find that <a href="http://cQuery.com" rel="nofollow">http://cQuery.com</a> is an interesting solution which uses CSS Selectors (much like jQuery) as its mechanism to extract content from live web pages.

prestonparrisabout 13 years ago

I created a dumb little script using this technique that lets you read hacker news in the terminal then opens up the story in your browser.<a href="https://github.com/prestonparris/node-hackernews" rel="nofollow">https://github.com/prestonparris/node-hackernews</a>

评论 #3680650 未加载

MatthewPhillipsabout 13 years ago

I'm confused, is the window object from Jsdom a live object? Can I scrape interactive sites?Isn't Phantomjs already perfect for scrapping, what is the advantage of this exactly?

评论 #3680192 未加载

评论 #3680202 未加载

mistercowabout 13 years ago

I just did this recently. It works great except when the pages you're scraping have JS errors on them.

dansoabout 13 years ago

OK, one thing I'm confused about...what's the advantage scraping with node/jquery over a traditional scripting language like Ruby + Nokogiri or Mechanize?It's true that this process won't render the page-w-ajax as your browser will, but I've found that if you do some web inspection of the page to determine the address and parameters for the backend scripts, then you don't even have to pull HTML at all. You just hit up the scripts and feed them parameters (or use Mechanize, if cookie/state-tracking is involved).

评论 #3680470 未加载

9 comments

sophaclesabout 13 years ago

评论 #3681760 未加载

评论 #3680685 未加载

lancefisherabout 13 years ago

评论 #3681478 未加载

评论 #3681433 未加载

评论 #3680830 未加载

lopatinabout 13 years ago

Good overview. I've been a fan of node.io for node scraping for a while. Lot's of stuff built in and you don't lose your jQuery selectors.

评论 #3680077 未加载

nchuhoaiabout 13 years ago

<a href="http://nokogiri.org/" rel="nofollow">http://nokogiri.org/</a>Perfectly fine use of css selectors and more

tcarnellabout 13 years ago

prestonparrisabout 13 years ago

评论 #3680650 未加载

MatthewPhillipsabout 13 years ago

I'm confused, is the window object from Jsdom a live object? Can I scrape interactive sites?Isn't Phantomjs already perfect for scrapping, what is the advantage of this exactly?

评论 #3680192 未加载

评论 #3680202 未加载

mistercowabout 13 years ago

I just did this recently. It works great except when the pages you're scraping have JS errors on them.

dansoabout 13 years ago

评论 #3680470 未加载

Scraping Web Pages With jQuery, Node.js and Jsdom

9 comments

Scraping Web Pages With jQuery, Node.js and Jsdom

9 comments