TechEcho

15 comments

mlandauerabout 11 years ago

If you're concerned about using a hosted scraping platform because it might disappear check out <a href="https://morph.io" rel="nofollow">https://morph.io</a> - it's open source as well <a href="https://github.com/openaustralia/morph" rel="nofollow">https://github.com/openaustralia/morph</a>

uuid_to_stringabout 11 years ago

As a PoC, I would be willing to "turn the web into data", i.e., produce one of the formats offered by these "services": CSV.I will use only standard UNIX utilities, no Python, etc. As such, you "own" the code. No SaaS. The result will be portable and run on any UNIX.I believe I can deliver in fewer words of code and that the result will be easier to modify when sites change.You pay nothing. Post your scraping "challenges" to HN.I enjoy turning web into data.Some people enjoy working with HTML, CSS, Javascript, etc. I prefer working with raw data.It is interesting to hear that some people are willing to pay to have the HTML, CSS, Javascript, etc. stripped out.

评论 #7584297 未加载

ycmikeabout 11 years ago

HN,So who do you guys use more? Import.io or Kimono? I have heard good things about both.

评论 #7583161 未加载

评论 #7586551 未加载

评论 #7583210 未加载

评论 #7583126 未加载

RaphiePSabout 11 years ago

There are a bunch of comments about rolling your own scraper instead of relying upon a possibly unreliable SaaS app.That makes me think -- would it be viable to run a service that, instead of running the scraping on their own servers, simply gave you a custom binary to run?Assuming that you trusted the executable, you would never have to worry about the company failing. It'd just be a one-time fee, and yours to use in perpetuity. Presumably updates would be free.

评论 #7583496 未加载

评论 #7583647 未加载

robotfelixabout 11 years ago

Great to see these guys are now out of Beta!While their real-time Extractors aren't quite as quick as doing it yourself, we've found them to be particularly useful for sites requiring JavaScript and/or cookies to use.It's also worth mentioning that it's quick to get started. You can start playing around with real data without having to dig into a site's URL structure, and then write your own scraper later if needed.

chrisherringabout 11 years ago

Isn't it illegal to scrape without permission? How would import.io handle the case when a large site comes back with legal threats when a user of their site has used scraped the wrong site? Can they claim non-responsibility?Also what happens when sites start blocking their IPs due to repeated scraping or is this unlikely to happen?

seivanabout 11 years ago

Heads up, the application is placed in ~/Desktop and not /Applications

th0br0about 11 years ago

They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an interesting concept, they've come far since their initial presentation and while the app has its quirks I have come to use it occasionally for some tasks.I hope that they'll manage to properly monetize on this - I don't see why I should pay for using a scraping rule if I can just write the scraper myself which doesn't cost me that much more time.

fiberteraabout 11 years ago

What kind of legitimate uses are there for something like this? This is not a sarcastic question. It seems like an obvious spam magnet, but if people are using it legitimately wouldn't their sources already be providing an API or RSS key?

评论 #7583510 未加载

评论 #7584362 未加载

评论 #7584029 未加载

评论 #7583477 未加载

thomabout 11 years ago

I suspect the real, top-secret business behind import.io is in either training a system to crawl the web and see structured data, and/or gathering over time a very rich crowd-sourced database of structured data.

jmethvinabout 11 years ago

We've posted answers to some of your questions on our blog: <a href="http://blog.import.io/post/you-ask-we-answer" rel="nofollow">http://blog.import.io/post/you-ask-we-answer</a>

pmtarantinoabout 11 years ago

Can someone tell me more about the law and scrapping websites?

评论 #7583267 未加载

late2partabout 11 years ago

Unfortunately, this doesn't seem to work too well on my mac. And, why do you want to know who my friends on Facebook are?

notduncansmithabout 11 years ago

Reminds me of <a href="https://www.kimonolabs.com/" rel="nofollow">https://www.kimonolabs.com/</a>

notastartupabout 11 years ago

I wrote <a href="http://scrape.ly" rel="nofollow">http://scrape.ly</a> if you wanna have a look, it's a url-based API for web scraping.

15 comments

mlandauerabout 11 years ago

uuid_to_stringabout 11 years ago

评论 #7584297 未加载

ycmikeabout 11 years ago

HN,So who do you guys use more? Import.io or Kimono? I have heard good things about both.

评论 #7583161 未加载

评论 #7586551 未加载

评论 #7583210 未加载

评论 #7583126 未加载

RaphiePSabout 11 years ago

评论 #7583496 未加载

评论 #7583647 未加载

robotfelixabout 11 years ago

chrisherringabout 11 years ago

seivanabout 11 years ago

Heads up, the application is placed in ~/Desktop and not /Applications

th0br0about 11 years ago

fiberteraabout 11 years ago

评论 #7583510 未加载

评论 #7584362 未加载

评论 #7584029 未加载

评论 #7583477 未加载

thomabout 11 years ago

jmethvinabout 11 years ago

We've posted answers to some of your questions on our blog: <a href="http://blog.import.io/post/you-ask-we-answer" rel="nofollow">http://blog.import.io/post/you-ask-we-answer</a>

pmtarantinoabout 11 years ago

Can someone tell me more about the law and scrapping websites?

评论 #7583267 未加载

late2partabout 11 years ago

Unfortunately, this doesn't seem to work too well on my mac. And, why do you want to know who my friends on Facebook are?

notduncansmithabout 11 years ago

Reminds me of <a href="https://www.kimonolabs.com/" rel="nofollow">https://www.kimonolabs.com/</a>

notastartupabout 11 years ago

I wrote <a href="http://scrape.ly" rel="nofollow">http://scrape.ly</a> if you wanna have a look, it's a url-based API for web scraping.

Import.io – Structured Web Data Scraping

15 comments

Import.io – Structured Web Data Scraping

15 comments