I had to use a similar approach when creating a cluster analysis of the amendments in the Italian Senate [0].<p>The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].<p>[0]: <a href="https://github.com/jacquerie/senato.py/blob/master/analysis.ipynb" rel="nofollow">https://github.com/jacquerie/senato.py/blob/master/analysis....</a><p>[1]: <a href="http://dati.senato.it/23" rel="nofollow">http://dati.senato.it/23</a><p>[2]: <a href="https://github.com/jacquerie/senato.py/blob/master/senato/spiders/senato_spider.py" rel="nofollow">https://github.com/jacquerie/senato.py/blob/master/senato/sp...</a>
So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?
I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are <i>jerks</i> for not doing so, and therefore scraping is the <i>moral</i> thing to do.<p>I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.<p>(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)
"Web scraping to create Open Data" is the exact reason why weboob (<a href="http://weboob.org/" rel="nofollow">http://weboob.org/</a>) was created and still thrives today. CityBikes already seems to be doing a big part of the job, and in Python nonetheless, so it should be easy to integrate its data and use it with Boobsize (<a href="http://weboob.org/applications/boobsize.html" rel="nofollow">http://weboob.org/applications/boobsize.html</a>)
Can anyone tell me which cloud provider they are using? I want to make sure that scrapinghub are on the list. I block the IP addresses of all the major cloud providers to prevent parasites such as this.
Heh. I do this with my Student Government data [1].<p>[1] <a href="https://umbc.lin.anticlack.com/finance/" rel="nofollow">https://umbc.lin.anticlack.com/finance/</a>