TechEcho

6 comments

I had to use a similar approach when creating a cluster analysis of the amendments in the Italian Senate [0].The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].[0]: <a href="https://github.com/jacquerie/senato.py/blob/master/analysis.ipynb" rel="nofollow">https://github.com/jacquerie/senato.py/blob/master/analysis....</a>[1]: <a href="http://dati.senato.it/23" rel="nofollow">http://dati.senato.it/23</a>[2]: <a href="https://github.com/jacquerie/senato.py/blob/master/senato/spiders/senato_spider.py" rel="nofollow">https://github.com/jacquerie/senato.py/blob/master/senato/sp...</a>

harperleeabout 9 years ago

So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?

评论 #11391794 未加载

评论 #11392778 未加载

minimaxirabout 9 years ago

I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)

评论 #11391771 未加载

评论 #11391816 未加载

评论 #11394660 未加载

评论 #11392167 未加载

rakooabout 9 years ago

"Web scraping to create Open Data" is the exact reason why weboob (<a href="http://weboob.org/" rel="nofollow">http://weboob.org/</a>) was created and still thrives today. CityBikes already seems to be doing a big part of the job, and in Python nonetheless, so it should be easy to integrate its data and use it with Boobsize (<a href="http://weboob.org/applications/boobsize.html" rel="nofollow">http://weboob.org/applications/boobsize.html</a>)

评论 #11391469 未加载

评论 #11394674 未加载

PlzSnowabout 9 years ago

Can anyone tell me which cloud provider they are using? I want to make sure that scrapinghub are on the list. I block the IP addresses of all the major cloud providers to prevent parasites such as this.

l1nabout 9 years ago

Heh. I do this with my Student Government data [1].[1] <a href="https://umbc.lin.anticlack.com/finance/" rel="nofollow">https://umbc.lin.anticlack.com/finance/</a>

评论 #11394401 未加载

6 comments

jnotarstefanoabout 9 years ago

harperleeabout 9 years ago

So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?

评论 #11391794 未加载

评论 #11392778 未加载

minimaxirabout 9 years ago

评论 #11391771 未加载

评论 #11391816 未加载

评论 #11394660 未加载

评论 #11392167 未加载

rakooabout 9 years ago

评论 #11391469 未加载

评论 #11394674 未加载

PlzSnowabout 9 years ago

l1nabout 9 years ago

Heh. I do this with my Student Government data [1].[1] <a href="https://umbc.lin.anticlack.com/finance/" rel="nofollow">https://umbc.lin.anticlack.com/finance/</a>

评论 #11394401 未加载

Web Scraping to Create Open Data

6 comments

Web Scraping to Create Open Data

6 comments