Ask HN: How Much Can I Scrape?

20 pointsby brentrover 16 years ago

I am working on a financial software project. I have written code in Python to get all of the historical price data for each stock in the S&P 500. I have tested the code using an input file of five ticker symbols and the code runs perfectly. I would like to get data for all 500 stocks in the S&P 500, however, I don't know if collecting this much data would go well with Yahoo. I have implemented my program so that it only sends out one request per minute, but I am still worried about turning my system loose.Has anyone else done something similar? For the people who own their own sites, how do you view scraping? Should I contact someone at Yahoo first?

16 comments

m0ntyover 16 years ago

Disguise your scrapes as a browser, so include an Explorer or Firefox browser ID string. Randomize the times between scrapes, so it looks more like a human being doing it. Make sure the scraper takes "coffee breaks" every now and then. Run the service from several servers at once, if you have them. I would guess your program is fairly low overhead (mine always have been) so contact friends and ask to use their server or home PCs. Extra credit for designing a cloud-like infrastructure where PCs could come-and-go without missing any data :)

评论 #360639 未加载

评论 #360500 未加载

spc476over 16 years ago

How far back? EOData (<a href="http://eoddata.com/" rel="nofollow">http://eoddata.com/</a>) has 15 years of pricing information for $20 for a number of exchanges, which would certainly save time and isn't horribly expensive for what you get.

评论 #360342 未加载

enomarover 16 years ago

Hate to be obvious, but reading their API terms of service might be a good place to start...

lackerover 16 years ago

If you contact someone at Yahoo, the response will be, do not scrape us in any way.The problem with financial data is that Yahoo (like most other sites where you might find this data) doesn't generate this data themselves. They license it from other companies, and the licensing agreement typically prohibits or greatly restricts Yahoo's ability to provide the data to third parties.That said, if Yahoo is not aware that you are scraping them, they cannot stop you. They certainly do have anti-scraper algorithms (you will start getting http 999 errors) but they will not kick in until you cross some invisible threshold. You can probably use Tor with no problem.Although, if you get large enough that someone notices, you will probably get some sort of cease & desist letter. Depending on your goals that might not be a problem for you.

oakmacover 16 years ago

I did this exact same thing a few years ago only stripping the data from nasdaq.com using Perl. I used to hit their site once every 2 seconds times roughly 2500 stocks every day for about 6 months. I would only grab as much data as I needed. They never contacted me or blocked my IP address. I also had a friend who was doing the same thing for a longer period of time.From experience, I would not recommend getting your data from Yahoo. I looked at them first, but their data is just not as good as the source.If you would like more information or my notes on how I reverse-engineered the nasdaq.com URL scheme please send me an email.

评论 #360616 未加载

kaensover 16 years ago

If I had a site that was scrape-worthy, I wouldn't care about it if the people were respectful about it (wait a second or to in between requests, don't hammer my server).From the business side, I could see them getting a bit grumpy about it, but if it's publicly available information, and there's nothing in their TOS about it, I don't see how they could do anything about it - again, unless you're being a dick with your scraper.Does anyone know off the top of their head if there are any relevant court cases dealing with scraping?

mikkomover 16 years ago

I've downloaded all the data they have for s&p 500 many times (I did it with processes, spawned one donwload about every 0.1 seconds). They block your connections if you download too fast.If they give out csv exports as they do there is no reason why someone wouldn't download them and use them for personal use.I guess you already know about the CSV download but if you don't, here is a link about it: <a href="http://www.diytraders.com/content/view/25/43/" rel="nofollow">http://www.diytraders.com/content/view/25/43/</a>I would however never ever use them in commercial product if that's what you are asking.

xefyrover 16 years ago

If you're really concerned about it you can go through an anonymizing proxy service. But, as has been said, if you have the time, spacing out your requests should work fine too.

qhoxieover 16 years ago

I got blocked during RailsRumble for pulling too much from Y! Finance. We did not have the time to throttle it.You should at least try to email them and see if your restrictions can be loosened.

dpmorelover 16 years ago

We scraped Yahoo Mail for about 6 months quite heavily. We had to keep it at a >5 minute timer otherwise we got captchas during the auth process, or we got locked out for 24 hours with error 999.We now have a formal agreement with Yahoo, but during the process Yahoo indicated they had an informal open policy on scraping. Note that they have an initiative to open up all services within the next year or so (google Yahoo Open Strategy to read more about timelines).

redorbover 16 years ago

1 request per a minute I don't think yahoo would even notice.

评论 #360258 未加载

评论 #360311 未加载

bgtonyover 16 years ago

biddersedge v. ebay is particularly interesting, as is verticalone.com, now yodlee.com.Trespass to chattels is an old roman law which dictates what should be done if you tresspass on my land and hurt one of my cattle, and it is used as the core of most cases involving scraping in unauthenticated environments (like Y! finance). They can come get you, not for taking data, but for costing them money to support the response volumes you demand. The magic number is $5,000, at which time it becomes a felony (or at least that was the threatening rhetoric, which is a different story altogether). You scrape, hurt their cow for 5k, and it is not a question of restitution, but of punishment. And in each case the scraper is typically viewed as a "thief"... not a label that inspires lighter punishments. see:www.biddersedge.com, yeah, exactly. nuked from orbit... and there, in a nutshell, is the risk inherent in scraping. All a scrapee has to do is wait for you to pass $5k... while they consider the pr ramifications of the whole thing... how much bandwidth and resources need to be used before the public will sympathize? 10k? 20k? 30k before they are lauded as a hero for removing the thieving vermin?Insidious really, scraping and scrapers are being "set up the bomb" here... to not be viewed as enabling the liberation of data, but rather as thieves of the resources necessary to deliver that data to the general public. Using trespass to chattels as a precedent is therefore a brilliant stroke... apparently, they can be taught. Or, to put it another way, scrapers aren't napster users in dorm rooms, they are felony thieves of public resources.Yeah, we all know that cease and desist and all other legal remedies are jurisdictionally challenged - the net doesn’t stop at international borders. And, historically, it seems that other countries turn a deaf ear to most cyber crime excepting, of course, for credit card fraud.Also, limit scraping via tor. Tor has a legitimate use which scraper volumes would impact. Of course, there are tor nets set up for "illegitimate" use... and they let anybody in, including folks like me, who then map all tor exit nodes used by scrapers and interdict em all...And, don't forget steganography... you take data (even through tor or rotating proxies) and redisplay it, google can find it and I can ask google to tell me where it is. Scrapers, even as very clever data middlemen, will get the squeeze from both sides as scrapees discover where their data is being displayed and utilize legal means to go after those storefronts, who will of course first provide name, rank and serial number of the scraper that provided them the data...And what about copyrights? Lots of legal precedent here, be careful with image redisplay. Mine field here...

toddcwover 16 years ago

This might help: <a href="http://blog.screen-scraper.com/2007/03/01/how-to-surf-and-screen-scrape-anonymously/" rel="nofollow">http://blog.screen-scraper.com/2007/03/01/how-to-surf-and-sc...</a>

ca98am79over 16 years ago

what kind of data do you want? real-time? or end of day? If you just want end of day data, it is simple. Just write a script that collects it in the middle of the night and stores it in your database. I don't think they mind at all if you just do it once a day for all of the stocks - I know people who have been doing it for years. They use this:<a href="http://www.gummy-stuff.org/Yahoo-data.htm" rel="nofollow">http://www.gummy-stuff.org/Yahoo-data.htm</a>If you want real-time data, good luck. It will cost you.

hotpocketsover 16 years ago

I don't think you have anything to worry about. I've scraped Yahoo finance before at about a 1 second request rate, using perl's YahooFinance module.

yawlover 16 years ago

Do not crawl too fast with yahoo, otherwise you will get 999 error -- which mean you will be banned temporarily.Search 'yahoo 999' for details.