TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Web scraping with Node.js

34 pointsby dandrewsenover 12 years ago

6 comments

chrissnellover 12 years ago
This makes me just a bit nervous. You're scraping bank websites using a headless WebKit browser, which is presumably vulnerable to future exploits. You have my username and password (and probably verification questions) either stored on or accessible from that same server. Who's to say that one of the sites you crawl won't get compromised and used as a vector to compromise your crawler box and--potentially--your customers' banking credentials?
nigglerover 12 years ago
I found casperjs (<a href="http://casperjs.org/" rel="nofollow">http://casperjs.org/</a>) to be a pleasant framework to work with.
评论 #4984853 未加载
lopatinover 12 years ago
This article should have mentioned node.io (<a href="https://github.com/chriso/node.io" rel="nofollow">https://github.com/chriso/node.io</a>) for completeness. It hasn't been updated in a while and I'm not sure if other frameworks have popped up, but I've had a pleasure using it for some big scraping tasks.
runningbreadover 12 years ago
I wonder how they get around the two level authentication problem? Even if I give my password to the scraper an extra credential would be required. How do you workaround that?
ilakshover 12 years ago
Isn't it almost always against the terms of service to scrape content off of websites?
评论 #4984865 未加载
salmanapkover 12 years ago
&#62;There’s no way to download resources with phantomjs – the only thing you can do is create a snapshot of the page as a png or pdf. That’s useful but meant we had to resort back to request() for the PDF download.<p>That's not a "problem", you shouldn't be using Webkit to download files.