TechEcho

This makes me just a bit nervous. You're scraping bank websites using a headless WebKit browser, which is presumably vulnerable to future exploits. You have my username and password (and probably verification questions) either stored on or accessible from that same server. Who's to say that one of the sites you crawl won't get compromised and used as a vector to compromise your crawler box and--potentially--your customers' banking credentials?

I found casperjs (<a href="http://casperjs.org/" rel="nofollow">http://casperjs.org/</a>) to be a pleasant framework to work with.

This article should have mentioned node.io (<a href="https://github.com/chriso/node.io" rel="nofollow">https://github.com/chriso/node.io</a>) for completeness. It hasn't been updated in a while and I'm not sure if other frameworks have popped up, but I've had a pleasure using it for some big scraping tasks.

I wonder how they get around the two level authentication problem? Even if I give my password to the scraper an extra credential would be required. How do you workaround that?

Isn't it almost always against the terms of service to scrape content off of websites?

>There’s no way to download resources with phantomjs – the only thing you can do is create a snapshot of the page as a png or pdf. That’s useful but meant we had to resort back to request() for the PDF download.<p>That's not a "problem", you shouldn't be using Webkit to download files.

I found casperjs (<a href="http://casperjs.org/" rel="nofollow">http://casperjs.org/</a>) to be a pleasant framework to work with.

I wonder how they get around the two level authentication problem? Even if I give my password to the scraper an extra credential would be required. How do you workaround that?

Isn't it almost always against the terms of service to scrape content off of websites?

Web scraping with Node.js

6 comments

Web scraping with Node.js

6 comments