TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Web Scraping to Create Open Data

176 pointsby stummjrabout 9 years ago

6 comments

jnotarstefanoabout 9 years ago
I had to use a similar approach when creating a cluster analysis of the amendments in the Italian Senate [0].<p>The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn&#x27;t offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;jacquerie&#x2F;senato.py&#x2F;blob&#x2F;master&#x2F;analysis.ipynb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jacquerie&#x2F;senato.py&#x2F;blob&#x2F;master&#x2F;analysis....</a><p>[1]: <a href="http:&#x2F;&#x2F;dati.senato.it&#x2F;23" rel="nofollow">http:&#x2F;&#x2F;dati.senato.it&#x2F;23</a><p>[2]: <a href="https:&#x2F;&#x2F;github.com&#x2F;jacquerie&#x2F;senato.py&#x2F;blob&#x2F;master&#x2F;senato&#x2F;spiders&#x2F;senato_spider.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jacquerie&#x2F;senato.py&#x2F;blob&#x2F;master&#x2F;senato&#x2F;sp...</a>
harperleeabout 9 years ago
So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?
评论 #11391794 未加载
评论 #11392778 未加载
minimaxirabout 9 years ago
I&#x27;m not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are <i>jerks</i> for not doing so, and therefore scraping is the <i>moral</i> thing to do.<p>I&#x27;ve scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying &quot;don&#x27;t scrape&quot; in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.<p>(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)
评论 #11391771 未加载
评论 #11391816 未加载
评论 #11394660 未加载
评论 #11392167 未加载
rakooabout 9 years ago
&quot;Web scraping to create Open Data&quot; is the exact reason why weboob (<a href="http:&#x2F;&#x2F;weboob.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;weboob.org&#x2F;</a>) was created and still thrives today. CityBikes already seems to be doing a big part of the job, and in Python nonetheless, so it should be easy to integrate its data and use it with Boobsize (<a href="http:&#x2F;&#x2F;weboob.org&#x2F;applications&#x2F;boobsize.html" rel="nofollow">http:&#x2F;&#x2F;weboob.org&#x2F;applications&#x2F;boobsize.html</a>)
评论 #11391469 未加载
评论 #11394674 未加载
PlzSnowabout 9 years ago
Can anyone tell me which cloud provider they are using? I want to make sure that scrapinghub are on the list. I block the IP addresses of all the major cloud providers to prevent parasites such as this.
l1nabout 9 years ago
Heh. I do this with my Student Government data [1].<p>[1] <a href="https:&#x2F;&#x2F;umbc.lin.anticlack.com&#x2F;finance&#x2F;" rel="nofollow">https:&#x2F;&#x2F;umbc.lin.anticlack.com&#x2F;finance&#x2F;</a>
评论 #11394401 未加载