If you're concerned about using a hosted scraping platform because it might disappear check out <a href="https://morph.io" rel="nofollow">https://morph.io</a> - it's open source as well <a href="https://github.com/openaustralia/morph" rel="nofollow">https://github.com/openaustralia/morph</a>
As a PoC, I would be willing to "turn the web into data", i.e., produce one of the formats offered by these "services": CSV.<p>I will use only standard UNIX utilities, no Python, etc. As such, you "own" the code. No SaaS. The result will be portable and run on any UNIX.<p>I believe I can deliver in fewer words of code and that the result will be easier to modify when sites change.<p>You pay nothing. Post your scraping "challenges" to HN.<p>I enjoy turning web into data.<p>Some people enjoy working with HTML, CSS, Javascript, etc. I prefer working with raw data.<p>It is interesting to hear that some people are willing to pay to have the HTML, CSS, Javascript, etc. stripped out.
There are a bunch of comments about rolling your own scraper instead of relying upon a possibly unreliable SaaS app.<p>That makes me think -- would it be viable to run a service that, instead of running the scraping on their own servers, simply gave you a custom binary to run?<p>Assuming that you trusted the executable, you would never have to worry about the company failing. It'd just be a one-time fee, and yours to use in perpetuity. Presumably updates would be free.
Great to see these guys are now out of Beta!<p>While their real-time Extractors aren't quite as quick as doing it yourself, we've found them to be particularly useful for sites requiring JavaScript and/or cookies to use.<p>It's also worth mentioning that it's quick to get started. You can start playing around with real data without having to dig into a site's URL structure, and then write your own scraper later if needed.
Isn't it illegal to scrape without permission? How would import.io handle the case when a large site comes back with legal threats when a user of their site has used scraped the wrong site? Can they claim non-responsibility?<p>Also what happens when sites start blocking their IPs due to repeated scraping or is this unlikely to happen?
They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an interesting concept, they've come far since their initial presentation and while the app has its quirks I have come to use it occasionally for some tasks.<p>I hope that they'll manage to properly monetize on this - I don't see why I should pay for using a scraping rule if I can just write the scraper myself which doesn't cost me that much more time.
What kind of legitimate uses are there for something like this? This is not a sarcastic question. It seems like an obvious spam magnet, but if people are using it legitimately wouldn't their sources already be providing an API or RSS key?
I suspect the real, top-secret business behind import.io is in either training a system to crawl the web and see structured data, and/or gathering over time a very rich crowd-sourced database of structured data.
We've posted answers to some of your questions on our blog: <a href="http://blog.import.io/post/you-ask-we-answer" rel="nofollow">http://blog.import.io/post/you-ask-we-answer</a>