<p><pre><code> if you can see or submit data using the website, it means the website does
have some kind of API ... For example, if you do a search in Yahoo, you can
see the search page sent to Yahoo’s servers has the following url ...
https://search.yahoo.com/search?p=search+term
</code></pre>
no. no no no. this is not an API. this is about as far from an application programming INTERFACE as it can get. this means an agreed format, where there's a contract (social or otherwise) to provide stability to APPLICATION clients. there's no contract here other than 'a human types something into the box, presses some buttons and some results appears on the website'.<p>/search?p=search+term is an implementation detail hidden from the humans the site is built for. they can, and most likely will, change this at any time. the HTML returned (and being scraped) is an implementation detail. today, HTML. tomorrow, AJAX. next week? who knows, maybe Flash.<p>fine, it's a scraper builder. but don't call what it's using an API, and don't imply it's anything more than a fragile house of cards built on the shaky foundation of 'if you get noticed you're going to get banned or a cease and desist'.
Just as a matter of record you risk getting your IP blacklisted by using something like this without the web sites permission. Perhaps the poster child for web sites that go apeshit is CraigsList but most sites respond in one way or another. One of my favorites are the Markov generated search results that Bing sends back to robots.
FYI Gargl vs Kimono - mentioned at the bottom original article.
<a href="http://jodoglevy.com/jobloglevy/?p=146" rel="nofollow">http://jodoglevy.com/jobloglevy/?p=146</a>
I think a better business model would be creating a service that identifies scrapers, and then blocks them. I think one might already exist, though I can't remember its name.
I do this kind of stuff with wget, sed, awk so far, but it's nice to see some more thought-out alternatives.<p>What I like most about your competition though is the JS interface that gets used for one good last thing (before being properly scraped and de-AD- and de-java-fied): clicking on the content you want, and deselecting content you don't want: subtly, with your mouse you lead a pattern-matching algorithm doing the annoying work.<p>Honestly the simplicity of this interface is even more breathtaking to me than gargl :P
But it's even more limited, as after clicking twice it thinks that it has understood the pattern already, although that might not be the case.<p>I'd suggest to integrate the idea, but to make the learning process more clever, make it possible to select more things, even though the engine thinks there can't be any more similar things. Give that AI more things to learn from. We want more identifiers than just counts and HTML elements: "2nd subelement of <h1>".<p>There's good stuff you can do with statistics, too. Some data exists only once, some exists only 3 times, some always exists over 10 times. That's valuable info. Some data has many words of whitespace seperated text - oh a paragraph!<p>tldr We need something that generates good semantics out of normal web sites automatically, so that users can use a simple Web UI mangled into the target web site to choose the right pattern.
IANAL et al., but unless I'm mistaken, <i>generating</i> an API by analyzing requests and responses would be fine (under the purview of "research purposes"), unless you then subsequently <i>use</i> the generated API to access the service.<p>Also, it seems like authenticated sites would be difficult to scrape with this, i.e. ones that require login and possibly some logic (like sending a hash of request parameters) with every request.
this is a clever idea - i've had a similar idea many many years ago infact but never followed through because i strongly disagree with making it easy to e.g. abuse google or yahoo by spamming their search engine. as much as i disagree with keeping proprietary secrets i agree more that people should have freedom of choice to do that...<p>in that regard its nice to see the big warning at the top of the page about ease of misuse (and a refreshing slap in the face - i was thinking 'pfft some hipster forgot common sense again' and expecting not to see anything of the kind)<p>there is something off here and i can't quite put my finger on it though... as a low level programmer I cringe when I hear web people using API to describe some weird little subset of APIs anyway. Here I feel almost like what this does is takes an existing 'API' (http - the internet) and refactors the interface in highly specific ways to make it easier to use...<p>At any rate. Its a clever idea and nice to see such a well thought through implementation - but its also far too open to misuse imo. I wish the creator the best of luck... hopefully no takedown requests too soon.
Your description of the problem and solution is too verbose. I need bullet points describing 1) my problems, 2) how my problems are solved by this. I'm not going to read a full-on blog post to figure out if this is relevant to me.
so the usage of the data is where the legality is concerned. if your users scrape a site and you host it through accessible means, you can get sued but not if you provide a flat csv file?<p>Armchair lawyers please advise, we need more details.