Web Scraping in 2016

852 pointsby franciskimover 8 years ago

33 comments

minimaxirover 8 years ago

Keep in mind that companies have sued for scraping not through the API, for example LinkedIn, which explicitly prevents scraping via the ToS: <a href="http://www.informationweek.com/software/social/linkedin-sues-after-scraping-of-user-data/d/d-id/1113362" rel="nofollow">http://www.informationweek.com/software/social/linkedin-sues...</a>OKCupid did a DMCA takedown for researchers releasing scraped data: <a href="https://www.engadget.com/2016/05/17/publicly-released-okcupid-profiles-taken-down-dmca-claim/" rel="nofollow">https://www.engadget.com/2016/05/17/publicly-released-okcupi...</a>Since both of these incidents, I now only scrape if it's a) through the API following rate limits or b) if there is no API, and the data has the explicit purpose of being shared publically (e.g blogs), I follow robots.txt. Of course, most companies have a do-not-scrape clause in their ToS anyways, to my personal frustration.(Disclosure: I have developed a Facebook Page Post Scraper [<a href="https://github.com/minimaxir/facebook-page-post-scraper" rel="nofollow">https://github.com/minimaxir/facebook-page-post-scraper</a>] which explicitly follows the permissions set by the Facebook API.)

评论 #12348860 未加载

评论 #12348271 未加载

评论 #12346133 未加载

评论 #12346627 未加载

评论 #12346276 未加载

评论 #12346082 未加载

评论 #12348066 未加载

评论 #12347714 未加载

评论 #12348582 未加载

评论 #12347864 未加载

评论 #12348284 未加载

评论 #12348691 未加载

评论 #12347543 未加载

评论 #12349525 未加载

评论 #12348509 未加载

mack73over 8 years ago

Corporations will abuse your personal integrity whenever they get a chance, while abiding the law. Corporations will cry like babies when their publicly available data (their livelyhood) gets scraped. They will take you to court.They consider their data to be theirs, even though they published it on the internet. They consider your data (your personal integrity) to be theirs as well, because how can you assume personal integrity when you are surfing the internet?I have high hopes that the judicial system some time not too far from now will realize that since the law should be a reflection of the current moral standings it will always be behind, trying to catch up with us and that those who break the law while not breaking the current moral standings are still "good citizens" unworthy of prison or fines.I guess Google won this iteration of the internet because of the double-standars site owners stand by, to allow Google to scrape anything while hindering any competitors from doing the same. There will only be a true competitor to Google when we in the next iteration of the internet realize that searching vast amounts of data (the internet) is a solved problem, that anyone can do as good a job as Google, and move on to the next quirk, around wich there will be competition, and in the end that quirk will be solved, we'll have a winner, signaling that is it time to move on to the next iteration.

评论 #12350104 未加载

评论 #12350060 未加载

评论 #12350859 未加载

fake-nameover 8 years ago

I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.---Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).---It's all on github, FWIW:Manager: <a href="https://github.com/fake-name/ReadableWebProxy" rel="nofollow">https://github.com/fake-name/ReadableWebProxy</a>Agent and salt scheduler: <a href="https://github.com/fake-name/AutoTriever" rel="nofollow">https://github.com/fake-name/AutoTriever</a>

评论 #12347269 未加载

评论 #12346693 未加载

评论 #12347778 未加载

评论 #12349448 未加载

评论 #12346456 未加载

评论 #12347972 未加载

评论 #12346665 未加载

prashntsover 8 years ago

A neat trick I sometimes use to "scrape" data from sites that use jquery ajax to load data is to plug in a middleware in jquery xhr:<pre><code> $.ajaxSetup({ dataFilter: function (data, type) { if (this.url === 'some url that you want to watch!') { // Do anything with the data here awesomeMethod(this.data) } return data } }) </code></pre> I remember last using it with an infinite-scroll page with a periodic callback that scrolled the page down every 2 seconds, and the `awesomeMethod` just initiated the download. Pasted it all in dev-tools console, and the cheap "scraper" was ready!

评论 #12349276 未加载

评论 #12348496 未加载

评论 #12346227 未加载

dansoover 8 years ago

This good list of tactics underscores, for me, how the state of the Web has made it a lot more difficult to teach web scraping as a fun exercise for newbie programmers. It used to be you could get by with an assumption that what you see in the browser is what you get when you download the raw HTML...but that's increasingly less common the case. So now you have to teach how to debug via the console and network panel, on top of basic HTTP concepts (such as query parameters).(Even more problematic is that college kids today seem to have a decaying understanding of what a URL is, given how much web navigation we do through the omnibar or apps, particularly on mobile, but that's another issue).I've been archiving a few government sites to preserve them for web scraping exercises [0] (the Texas death penalty site is a classic, for both being relatively simple at first, and being incredibly convoluted depending on what level of detail you want to scrape [1])). But I imagine even government sites will move more toward AJAX/app-like sites, if the trend at the federal level means anything.That said, I think the analytics.usa.gov site is a great place to demonstrate the difference between server-generated HTML and client-rendered HTML.But as someone who just likes doing web-scraping, I feel the tools have mostly kept up with the changes to the web. It's been relatively easy, for example, to run Selenium through Python to mimic user action [2]. Same with PhantomJS through node, which has vastly improved how accurately it renders pages for screenshots compared to what I remember a few years back[0] <a href="https://github.com/wgetsnaps" rel="nofollow">https://github.com/wgetsnaps</a>[1] <a href="https://github.com/wgetsnaps/tdcj-state-tx-us--death_row" rel="nofollow">https://github.com/wgetsnaps/tdcj-state-tx-us--death_row</a>[2] <a href="https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92" rel="nofollow">https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92</a>

评论 #12346371 未加载

XCSmeover 8 years ago

Tbh I didn't enjoy the article, it just seems like someone who has just learned about Node.js tried to explain (and mostly failed) how to use some packages to scrape a page. I was expecting to learn some new techniques, but all it explained was how to make a few API calls in order to solve a very specific problem. Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

评论 #12347172 未加载

评论 #12347042 未加载

评论 #12346233 未加载

评论 #12347038 未加载

评论 #12346539 未加载

评论 #12347281 未加载

评论 #12346493 未加载

Jake232over 8 years ago

Not wanting to thread hijack, but just going to post an article I wrote a few years back as it covers a few other things that are still relevant; and often still gets referenced. May it'll help some people out in combination with OP's post.<a href="http://jakeaustwick.me/python-web-scraping-resource/" rel="nofollow">http://jakeaustwick.me/python-web-scraping-resource/</a>

评论 #12348774 未加载

评论 #12346397 未加载

stupidcarover 8 years ago

I wrote a fairly complex spidering and scraping script in Node a few months ago. I found downcache[1] to be absolutely invaluable, particularly as I was debugging my parsing scripts, a I was able to rerun them relatively quickly over the cached responses.However, when the network was no longer a bottleneck, I found that the speed and single-threaded nature of Node became one. It wasn't really that slow, relatively speaking, but I had a few hundred gigs of HTML to chew through every time I made a correction, so it was important to keep the turnaround as fast as possible.I eventually managed to manually partition the task so I could launch separate Node scripts to handle different parts of it, but it wasn't a perfect split, and there was a fair bit of duplicated work, where a shared cache would have helped a great deal.In retrospect, I should have thrown my JS away and started again in something with easy threading like Java or C#. But -- familiar story -- I'd underestimated the complexity of the task to begin with, and by the time I understood, I'd sunk a lot of time into writing my JS parsing code and didn't fancy converting it all to another language, particular when it always seemed like "just one more" correction to the parsing would make everything work right. In the end, what was supposed to take a weekend took about three months of work, off and on, to finish.[1] <a href="https://www.npmjs.com/package/downcache" rel="nofollow">https://www.npmjs.com/package/downcache</a>

评论 #12347495 未加载

评论 #12347774 未加载

dchukover 8 years ago

Scraping with Selenium in Docker is pretty great, especially because you can use the Docker API itself to spin up/shut down containers at will. So you can spin up a container to hit a specific URL in a second, scrape whatever you're looking for, then kill the container. This can be done via a job queue (sidekiq if you're using Ruby) to do all sorts of fun stuff.That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.

评论 #12346450 未加载

评论 #12346313 未加载

mosburgerover 8 years ago

> AngelList even detects PhamtomJS (have not seen other sites do this).I run a site that aggregates/crawls job boards for remote job postings, and AngelList has been VERY difficult to crawl for various reasons, but you easily get PhantomJS to work (I have). Having said that, I've never felt very good about the fact that I'm defeating their attempts to block me (even though I feel like I'm doing them a favor) and will likely retire that bot soon.It kinda sucks that I'm just grabbing publicly-available content in a very low-bandwidth way, but I really can't convince myself that what I'm doing is very ethical.My to-do list includes making my crawler into a more well-behaved bot and that will have to go.

评论 #12346865 未加载

评论 #12346339 未加载

评论 #12347618 未加载

paultover 8 years ago

I don't know why more people don't use chrome extensions for scraping. Using a boilerplate[1], you can get a scraper up and running in minutes. Start a node server that serves up urls and stores parsed data, and run the scraper in the browser. Best of all, you can watch it running and debug if something goes wrong. I know it doesn't scale well if you're running a SaaS, but for personal projects and research/data normalization it's the lowest barrier to entry, in my opinion.[1] <a href="http://extensionizr.com" rel="nofollow">http://extensionizr.com</a>

franciskimover 8 years ago

Sorry guys, hit by traffic - just scaling my EC2 at the moment.

评论 #12346167 未加载

评论 #12347124 未加载

评论 #12350917 未加载

评论 #12346164 未加载

jgmmoover 8 years ago

Good stuff.I do a good bit of scraping, and made RubyRetriever[1] to make my life easier but it seems like I'm getting roadblocked on occasion, probably due to some of the things you mention in your article.Is there any way for a site to verify that only their JS and CSS files are linked? Like preventing injection?[1]: <a href="https://github.com/joenorton/rubyretriever" rel="nofollow">https://github.com/joenorton/rubyretriever</a>

评论 #12346844 未加载

评论 #12346410 未加载

nreeceover 8 years ago

At Feedity (<a href="https://feedity.com" rel="nofollow">https://feedity.com</a>), we "index" webpages to generate custom feeds. Over the years, we've designed our system to use a mix of technologies like .NET (C#) and node.js, and implemented a bunch of tweaks and optimizations for seamless & scalable access to public content.

评论 #12349257 未加载

评论 #12350475 未加载

IANADover 8 years ago

> But if you are automating your exact actions that happen via a browser, can this be blocked?Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.If you scrape for a living, only do it as a side job.

评论 #12370776 未加载

评论 #12350827 未加载

headmeltedover 8 years ago

I actually love Selenium for this purpose, for much the same reasons the author mentions here.It's almost impossible for a website to reliably detect that a client web browser is being automated, and I find I can make Selenium scripts much more adaptable to breaking changes in websites when they occur than I can when hooking up my code directly.I actually disagree with the contention that Selenium is slower than directly scraping though. The Firefox driver has always been lightning fast for me and the bottleneck is almost always server requests that would have been necessary either way.

lambyover 8 years ago

Whilst they mean well, I find this a fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these "scrape the web with X" articles.The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.Et cetera.In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.

评论 #12350619 未加载

Twisellover 8 years ago

What bother me the most is that recently I wanted to extract and archive of all the threads I participated in from an Internet forum. The webmaster told me that the BBS he use don't provide such a function and that I just had to download each thread manually... (300+ thread in my case).He then say that it don't bother him if I scrape theses thread. And I'm currently figuring out how to manage his site's cookie protected search feature, so that my painstaking effort (I'm not a dev, more a DB guy) could be reproducible more easily by other users of this service.But this shouldn't appen in the first place because all post of this service are stored in a cleanly organized MySQl DB. Yet as no method is provided the only way to get back structured data is by scrapping (as the webmaster told me that no, he won't run custom SQL because "he don't want to mess his DB").So even if all the data is publicly available through the internet forum only a geek can download a personal archive... or google because google scrape and store everything.

评论 #12350512 未加载

KennyCasonover 8 years ago

As someone who does a lot of scraping, I was happy to learn about Antigate :)

评论 #12346348 未加载

kingkool68over 8 years ago

It's trivial to scrape public Instagram URLs...<a href="https://github.com/kingkool68/zadieheimlich/blob/master/functions/instagram.php#L421-L428" rel="nofollow">https://github.com/kingkool68/zadieheimlich/blob/master/func...</a>

etatobyover 8 years ago

Does anybody know what the author means by "lead" (noun)?I don't think it's any of the regular meanings: <a href="http://www.ldoceonline.com/search/?q=Lead" rel="nofollow">http://www.ldoceonline.com/search/?q=Lead</a>But it doesn't seem to be any of these slang terms either: <a href="http://www.urbandictionary.com/define.php?term=lead" rel="nofollow">http://www.urbandictionary.com/define.php?term=lead</a>

评论 #12351535 未加载

writeslowlyover 8 years ago

Have you run into any issues from running all of your scrapers off of AWS, or just from sites detecting that you're accessing large numbers of pages in some sort of obvious pattern? I guess I was hoping there would be sites with more interesting ways to screw with web scrapers (rearranging certain page elements or something) than just throwing up a CAPTCHA.

评论 #12346283 未加载

评论 #12346355 未加载

评论 #12346611 未加载

评论 #12346483 未加载

zzzcpanover 8 years ago

> But if you are automating your exact actions that happen via a browser, can this be blocked?Of course it can! You won't be able to defeat even the simplest attempt on anti-scraping based on statistical data. Like even keeping a list of individual rate-limits for /16 subnets of actual visiting users and you are in trouble.

kevindeasisover 8 years ago

Does cheerio account for single page apps? In any case thanks for the tutorial!Anyways I added your stuff here along with other data mining resource:<a href="https://github.com/kevindeasis/awesome-fullstack#web-scraping" rel="nofollow">https://github.com/kevindeasis/awesome-fullstack#web-scrapin...</a>

elchiefover 8 years ago

To fight scrapers, we show some values as images that look like text (but not all the time)And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames. This fucks with xpaths and css selectors.You can't stop them, but you can make their lives painful.

评论 #12346448 未加载

评论 #12346565 未加载

评论 #12364306 未加载

评论 #12346343 未加载

评论 #12346320 未加载

skeletonjellyover 8 years ago

Hooray Melbourne! Would be interested seeing this at a meetup group if you were thinking of presenting.

评论 #12349640 未加载

frostymarvelousover 8 years ago

While everyone is busy debating whether scraping is bad or legal, I just can't stop thinking a out Antigate.Of the sweatshops that must have been setup to deliver this service. That, is to me the true horror of this story.

sligover 8 years ago

I wonder how effective the CloudFlare anti-scrapper protection is against this approach of breaking captchas.Also, I find it interesting that big websites don't just block all traffic from AWS IPs as they do with Tor.

评论 #12347298 未加载

评论 #12346383 未加载

评论 #12346446 未加载

unixheroover 8 years ago

And from the trenches:- rails application- scraping with nokogiri gem on Ruby- simple models doing the scraping in rails app- some scraping is parsed with CSS selectors - nokogiri- some scraping is parsed with regex - nokogiri- persisting to DB, Text, even Google docs- presentation on web, text, pdf, xlsBoom

ge95over 8 years ago

How do you push a button like hit next on a paginated page?

评论 #12348661 未加载

rchover 8 years ago

There is so much that's missing from this. What about gathering tokens from customers vs. paying for social data feeds? How about canned services like 80legs?

评论 #12346767 未加载

ben_jonesover 8 years ago

Currently getting 502 Gateway. Guessing this post is also trending on reddit and we hugged it to death :(.

评论 #12346150 未加载

评论 #12346358 未加载

评论 #12346361 未加载

rezashirazianover 8 years ago

When I was building liisted.com I scraped using Selenium and it worked great.