I Don’t Need No Stinking API: Web Scraping For Fun and Profit

279 点作者 hartleybrody超过 12 年前

42 条评论

bdcravens超过 12 年前

I've done a ton of scraping (mostly legal: on behalf of end users of an app on sites they have legit access to). This article misses something that affects several sites: JavaScript driven content. Faking headers and even setting cookies doesn't get around this. This is of course is easy to get around, using something like phantom.js or Selenium. Selenium is great because unlike all the whiz bang scraping techniques, you're driving a real browser and your requests look real (if you make 10000 requests to index.php and never pull down a single image, you might look a bit suspicious). There's a bit more overhead, but micro instances on EC2 can easily run 2 or 3 Selenium sessions at the same time, and at 0.3 cents per hour for spot instances, you can have 200-300 browsers going for 30-50 cents/hour.

评论 #4894096 未加载

评论 #4894024 未加载

评论 #4894294 未加载

评论 #4894488 未加载

derrida超过 12 年前

(shameless plug) I can scrape asynchronously, anonymously, with JS wizardry, and feed it into your defined models in your MVC (e.g. Django). But! I need to get to a hacker conference on the other side of the world (29c3). Any other time of year, I'd just drop a tutorial. See profile if you'd like to help me with a consulting gig.EDIT: Knowledge isn't zero-sum. Here's an overview of a kick-ass way to spider/scrape:I use Scrapy to spider asynchronously. When I define the crawler bot as an object, if the site contains complicated stuff (stateful forms or javascript) I usually create methods that involve importing either Mechanize or QtWebKit. Xpath selectors are also useful for the ability to not have to specify the entire XML tree from trunk to leaf. I then import pre-existing Django models from a site I want the data to go into and write to the DB. At this point you usually have to convert some types.I find Scrapy cleaner and more like a pipeline so it seems to produce less 'side effect kludge' than other scraping methods (if anybody has seen a complex Beautiful Soup + Mechanize scraper you know what I mean by 'side effect kludge'). It can also act as a server to return json.Being asynchronous, you can do crazy req/s.I will leave out how to do all this through Tor because I don't want the Tor network being abused but am happy to talk about it one on one if your interest is beyond spamming the web.Through this + a couple of unmentioned tricks, it's possible to get insane data, so much so it crosses over into security research & could be used for pen-testing.

toyg超过 12 年前

And this is why we can't have nice things.Web scraping, as fun as it is (and btw, this title again abuses "Fun and Profit"), is not a practice we should encourage. Yes, it's the sort of dirty practice many people do, at one point or another, but it shouldn't be glorified.

评论 #4893977 未加载

评论 #4893952 未加载

评论 #4893931 未加载

评论 #4894691 未加载

rsingel超过 12 年前

There are some recent federal cases (Weev <a href="http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-embarassment-becomes-a-crime/" rel="nofollow">http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-em...</a>, Aaron Swartz<a href="http://www.wired.com/threatlevel/2012/09/aaron-swartz-felony/" rel="nofollow">http://www.wired.com/threatlevel/2012/09/aaron-swartz-felony...</a>, and a prosecution of scalpers <a href="http://www.wired.com/threatlevel/2010/07/ticketmaster/" rel="nofollow">http://www.wired.com/threatlevel/2010/07/ticketmaster/</a>) that view scraping as a felony hacking offense. The feds think that an attempt to evade CAPTCHAS, IP and MAC blocks is a felony worthy of years in prison.In fact, the feds might think that clearing your cookies or switching browsers to get another 10 free articles from the NYTimes is also felony hacking.Which is to say, be careful what you admit to in this forum AND how you characterize what you are doing in your private conversations and e-mails.Weev now faces a decade or more in prison because he drummed up publicity by sending emails to journalists that used the verb "stole".

评论 #4896913 未加载

kaffeinecoma超过 12 年前

From the article:<pre><code> Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients. This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit. </code></pre> Pretty sure the browser Same Original Policy forbids this. Think about it- if this worked, you'd be able to scrape inside corporate firewalls simply by having users visit your website from behind the firewall.

评论 #4894754 未加载

评论 #4894916 未加载

kevinpfab超过 12 年前

The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.

评论 #4893967 未加载

评论 #4893995 未加载

评论 #4893932 未加载

评论 #4896584 未加载

评论 #4894277 未加载

cynwoody超过 12 年前

Great read!In the past, I have successfully used HtmlUnit to fulfill my admittedly limited scraping needs.It runs headless, but it has a virtual head designed to pretend it's a user visting a web application to be be tested for QA purposes. You just program it to go through the motions of a human visting a site to be tested (or scraped). E.g., click here, get some response. For each whatever in the response, click and aggregate the results in your output (to whatever granularity).Alas, it's in Java. But, if you use JRuby, you can avoid most of the nastiness that implies. (You do need to know Java, but at least you don't have to write Java.)Hartley, what is your recommended toolkit?I note you mentioned the problem of dynamically generated content. You develop your plan of attack using the browser plus Chrome Inspector or Firebug. So far, so good. But what if you want to be headless? Then you need something that will generate a DOM as if presenting a real user interface but instead simply returns a reference to the DOM tree that you are free to scan and react to.

评论 #4894015 未加载

RaSoJo超过 12 年前

I love HTML scraping. But Javascript???...The juiciest data sets these days are increasingly in JS. For the love of me i can't get around scraping JS :(I do know that Selenium can be used for this...but am yet to see a decent example for the same. Does anyone have any good resources/examples on JS scraping that they could share?? I would be eternally grateful.

评论 #4894085 未加载

评论 #4894033 未加载

评论 #4893997 未加载

评论 #4894318 未加载

评论 #4894410 未加载

评论 #4894308 未加载

评论 #4894524 未加载

bdcravens超过 12 年前

Another issue not covered: file downloads. Let's say you have a process that creates a dynamic image, or logs in and downloads dynamic PDFs. Even Selenium can't handle this (the download dialog is an OS-level feature). At one point I was able to get Chrome to auto-download in Selenium, but had zero control over filename and where it was saving. I ended up using iMacros (the pay version) to drive this (using Windows instances: their Linux version is very immature comparably).

评论 #4894043 未加载

mmastrac超过 12 年前

I'm surprised that no one has attempted to write a Twitter client based solely on scraping to get around the token limits.

评论 #4894020 未加载

评论 #4893945 未加载

评论 #4894617 未加载

评论 #4895847 未加载

lazyjones超过 12 年前

Scraping could be made a lot harder by website publishers, but they all depend on the biggest scraper accessing their content so it can bring traffic: Google ...The biggest downside of scraping is that it often takes a long time for very little content (e.g. scraping online stores with extremely bloated HTML and 10-25 products/per page).

评论 #4894833 未加载

joe_the_user超过 12 年前

An important topic.The main caveat is that this may violate a site's terms of use and thus website owners may feel called upon to sue you. Depending on circumstances, the legal situation here can be a long story.

评论 #4894527 未加载

zarino超过 12 年前

Related: If you fancy writing scrapers for fun and profit, ScraperWiki (a Liverpool, UK-based data startup) is currently hiring full-time data scientists. Check us out!<a href="http://scraperwiki.com/jobs/#swjob5" rel="nofollow">http://scraperwiki.com/jobs/#swjob5</a>

评论 #4894920 未加载

jbranchaud超过 12 年前

The title makes it sound as if there is going to be some discussion of how the OP has made web scraping profitable, but this seems to have been left to the reader's imagination.Otherwise, great article! I agree that BeautifulSoup is a great tool for this.

mcgwiz超过 12 年前

It's pointless to think of it as "wrong" for third-parties to web-scrape. Entities will do as they must to survive. The onus of mitigating web scraping, if in the interests of the publisher, is on the publisher.As a startup developer, third-party scraping is something I need to be aware of, that I need to defend against if doing so suits my interests. A little bit of research shows that this is not impractical. Dynamic IP restrictions (or slowbanning), rudimentary data watermarking, caching of anonymous request output all mitigate this. Spot-checking popular content by running it through Google Search requires all of five minutes per week. At that point, the specific situation can be addressed holistically (a simple attribution license might make everyone happy). With enough research, one might consider hellbanning the offender (serving bogus content to requests satisfying some certain heuristic) as a deterrent. A legal pursuit with its cost would likely be a last resort.Accept the possibility of being scraped and prepare accordingly.

im3w1l超过 12 年前

People seem to wonder how to handle ajax.The answer is HttpFox. It records all http-requests.1. Start recording2. Do some action that causes data to be fetched3. Stop recording.You will find the url, the returned data, and a nice table of get and post-variables.<a href="https://addons.mozilla.org/en-us/firefox/addon/httpfox/" rel="nofollow">https://addons.mozilla.org/en-us/firefox/addon/httpfox/</a>

评论 #4894762 未加载

评论 #4898370 未加载

评论 #4894504 未加载

metalruler超过 12 年前

From a site owner's perspective: if you have a LOT of data then scraping can be very disruptive. I've had someone scraping my site for literally months, using hundreds of different open proxies, plus multiple faked user-agents, in order to defeat scraping detection. At one point they were accessing my site over 300,000 times per day (3.5/sec), which exceeded the level of the next busiest (and welcome) agent... Googlebot. In total I estimate this person has made more than 30 million fetch attempts over the past few months. I eventually figured out a unique signature for their bot and blocked 95%+ of their attempts, but they still kept trying. I managed to find a contact for their network administrator and the constant door-knocking finally stopped today.

mbustamante超过 12 年前

when i need to scrap a webpage, i use phpQuery (<a href="http://code.google.com/p/phpquery/" rel="nofollow">http://code.google.com/p/phpquery/</a>), it's dead simple if you have experience with jQuery and i get all the benefits of a server-side programming language.

评论 #4894842 未加载

评论 #4894008 未加载

SiVal超过 12 年前

What I wish I could do is capture Flash audio (or any audio) streams with my Mac. All I want is to listen to the audio-only content with an audio player when I'm out driving or jogging, etc. Audio-only content that has to be played off a web page usually runs into the contradiction that if I'm in a position to click buttons on my web browser (not driving, for example), I'm in a position to do my REAL work and have no time to listen to the audio. I'll go to the web page, see whatever ads they may have, but then I'd like to be able to "scrape" the audio stream into a file so I don't have to sit there staring at a static web page the whole time I'm listening.

评论 #4896358 未加载

评论 #4896368 未加载

评论 #4896486 未加载

SG-超过 12 年前

When scraping HTML where data gets populated with js/ajax, you can get a web inspector to look at where that data is coming from and manually GET it and it will likely be in some nice JSON.Scraping used to be the way to get data back in the days, but websites also didn't change their layout/structure on a weekly basis too back then and were much more static when it came to the structure.Having recently written a small app that was forced to scrape HTML and having to update it every month to make it keep working, I can't imagine doing this for a larger project and maintaining it.

alhenaadams超过 12 年前

To all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site to your mind.

评论 #4896953 未加载

thomasrambaud超过 12 年前

I think the author just completly missed the point with API vs Screen scraping. The API allows for accessing structured data. Even if the website changes once, the datas would be accessible the same way through the API. Whereas, the author, would have to rewrite his code each time an update his made to the front-office code of the website.A simple API providing simple json response with http basic auth is far more efficient than a screen scraping program where you have to parse the response using HTML / XML parsers.

评论 #4898211 未加载

6ren超过 12 年前

This illustrates the significant difference between the use-cases of "web APIs" and conventional APIs, that the former are more like a database CRUD (including REST), rather than a request for computation. They (usually) are an alternative interface to a website (a GUI), and that's how most websites are used. e.g. an API for HN would allow story/comment retrieval, voting, submission, commenting.They could be used for computation, but (mostly) aren't.

treelovinhippie超过 12 年前

Not every site. There is data I would really love to access on Facebook without having to gain specific authorization from the user. It's odd that for most user profiles the most you can extract via the graph API (with no access token) is their name and sex. Whereas I can visit their profile page in the browser, see all sorts of info and latest updates (and not even be friends with them)Tried scraping Facebook. They have IP blocks and the like.

评论 #4895093 未加载

kuhn超过 12 年前

This is a shameless plug but I've created a service that aims to help with a lot of the issues that OP describes such as rate limiting, JS and scaling. It's a bit like Heroku for web scraping and automation. It's still in beta but if anyone is interested then check out <a href="http://tubes.io" rel="nofollow">http://tubes.io</a>.

senthilnayagam超过 12 年前

I have done a bit of scrapping with ruby mechanize, when we hit limits have circumvented by proxy and torgoogle as a search engine crawls most all sites, but offers very few usable stuff to other bots<a href="http://www.google.com/robots.txt" rel="nofollow">http://www.google.com/robots.txt</a>Disallow 247 Allow 41

kragen超过 12 年前

Be careful. I got banned from Google for scraping. I did a few hundred thousand searches one day, and that night, they banned my office IP address for a week. This was in 2001, so I estimate I cost them a few hundred dollars, which is now impossible to repay. :(

clark-kent超过 12 年前

The problem with scraping instead of using the API is that when the website makes even a slight change to their markup it breaks your code. I have had that experience and it's a living hell. I can say it's not worth it to scrap when there is an API available.

aleprok超过 12 年前

There is just one major trouble with not needing stinking API. You can not POST as a possible client without requiring them to give their password to you, which actually would give you full access to their account instead of limited access with API.

评论 #4894709 未加载

thenomad超过 12 年前

I had to do some scraping of a rather Javascript-heavy site last year - I found the entire process was made almost trivial using Ruby and Nokugiri. Particularly relevant for a non-uber-programmer like me, it's simple to use, as well as powerful.

jmgunn87超过 12 年前

So bloody true. A web page is a resource just like an xml doc, there's no reason public facing urls and web content can't be treated as such and I regularly take advantage of that fact aswell. great post

pknerd超过 12 年前

If it's not automated and a fewer times, I will prefer IMacro to perform tasks on my behalf. The best part of it that you can integrate a Db to record your desired data.

reledi超过 12 年前

Automated web testing tools, such as Watir and Selenium, are also pretty good options. I'm especially surprised Watir hasn't been mentioned yet in the comments.

评论 #4896172 未加载

tectonic超过 12 年前

Checkout <a href="http://selectorgadget.com" rel="nofollow">http://selectorgadget.com</a> as a useful tool for coming up with CSS selectors.

opminion超过 12 年前

How about publicly available web scraping tools as a way to encourage sites to provide good APIs? Everybody wants efficiency, after all.

bconway超过 12 年前

No Rate-LimitingClearly someone's never spent time diagnosing the fun that is scaping HN (yes, unofficial API is available).

shocks超过 12 年前

Node.js is excellent for web scaping, especially if you're scraping large amounts very often.

评论 #4894735 未加载

ComputerGuru超过 12 年前

What is it with all the headlines this week abusing the classic "for fun and profit" title?

eranation超过 12 年前

relevant: <a href="http://www.codinghorror.com/blog/2009/02/rate-limiting-and-velocity-checking.html" rel="nofollow">http://www.codinghorror.com/blog/2009/02/rate-limiting-and-v...</a>

yayitswei超过 12 年前

I've found diffbot to be quite useful for scraping.

buster超过 12 年前

I so not agree with that article, it makes me sick. And this guy basically is some "marketer" so no wonder he gets quite some stuff wrong, imo. :p

评论 #4894660 未加载

thisisnotatest超过 12 年前

Craigslist, anyone?

评论 #4896459 未加载

评论 #4897704 未加载

42 条评论

bdcravens超过 12 年前

评论 #4894096 未加载

评论 #4894024 未加载

评论 #4894294 未加载

评论 #4894488 未加载

derrida超过 12 年前

toyg超过 12 年前

评论 #4893977 未加载

评论 #4893952 未加载

评论 #4893931 未加载

评论 #4894691 未加载

rsingel超过 12 年前

评论 #4896913 未加载

kaffeinecoma超过 12 年前

评论 #4894754 未加载

评论 #4894916 未加载

kevinpfab超过 12 年前

评论 #4893967 未加载

评论 #4893995 未加载

评论 #4893932 未加载

评论 #4896584 未加载

评论 #4894277 未加载

cynwoody超过 12 年前

评论 #4894015 未加载

RaSoJo超过 12 年前

评论 #4894085 未加载

评论 #4894033 未加载

评论 #4893997 未加载

评论 #4894318 未加载

评论 #4894410 未加载

评论 #4894308 未加载

评论 #4894524 未加载

bdcravens超过 12 年前

评论 #4894043 未加载

mmastrac超过 12 年前

I'm surprised that no one has attempted to write a Twitter client based solely on scraping to get around the token limits.

评论 #4894020 未加载

评论 #4893945 未加载

评论 #4894617 未加载

评论 #4895847 未加载

lazyjones超过 12 年前

评论 #4894833 未加载

joe_the_user超过 12 年前

评论 #4894527 未加载

zarino超过 12 年前

评论 #4894920 未加载

jbranchaud超过 12 年前

mcgwiz超过 12 年前

im3w1l超过 12 年前

评论 #4894762 未加载

评论 #4898370 未加载

评论 #4894504 未加载

metalruler超过 12 年前

mbustamante超过 12 年前

评论 #4894842 未加载

评论 #4894008 未加载

SiVal超过 12 年前

评论 #4896358 未加载

评论 #4896368 未加载

评论 #4896486 未加载

SG-超过 12 年前

alhenaadams超过 12 年前

To all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site to your mind.

评论 #4896953 未加载

thomasrambaud超过 12 年前

评论 #4898211 未加载

6ren超过 12 年前

treelovinhippie超过 12 年前

评论 #4895093 未加载

kuhn超过 12 年前

senthilnayagam超过 12 年前

kragen超过 12 年前

clark-kent超过 12 年前

aleprok超过 12 年前

评论 #4894709 未加载

thenomad超过 12 年前

jmgunn87超过 12 年前

pknerd超过 12 年前

If it's not automated and a fewer times, I will prefer IMacro to perform tasks on my behalf. The best part of it that you can integrate a Db to record your desired data.

reledi超过 12 年前

Automated web testing tools, such as Watir and Selenium, are also pretty good options. I'm especially surprised Watir hasn't been mentioned yet in the comments.

评论 #4896172 未加载

tectonic超过 12 年前

Checkout <a href="http://selectorgadget.com" rel="nofollow">http://selectorgadget.com</a> as a useful tool for coming up with CSS selectors.

opminion超过 12 年前

How about publicly available web scraping tools as a way to encourage sites to provide good APIs? Everybody wants efficiency, after all.

bconway超过 12 年前

No Rate-LimitingClearly someone's never spent time diagnosing the fun that is scaping HN (yes, unofficial API is available).

shocks超过 12 年前

Node.js is excellent for web scaping, especially if you're scraping large amounts very often.

评论 #4894735 未加载

ComputerGuru超过 12 年前

What is it with all the headlines this week abusing the classic "for fun and profit" title?

eranation超过 12 年前

relevant: <a href="http://www.codinghorror.com/blog/2009/02/rate-limiting-and-velocity-checking.html" rel="nofollow">http://www.codinghorror.com/blog/2009/02/rate-limiting-and-v...</a>

yayitswei超过 12 年前

I've found diffbot to be quite useful for scraping.

buster超过 12 年前

I so not agree with that article, it makes me sick. And this guy basically is some "marketer" so no wonder he gets quite some stuff wrong, imo. :p