The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...<p>All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...
I expected an April Fool's joke and found something pleasantly awesome and useful instead.<p>Source is here: <a href="https://github.com/scrapinghub/portia" rel="nofollow">https://github.com/scrapinghub/portia</a>
I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.
I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.<p>For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:<p>An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.<p>The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.
Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.
I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at <a href="https://hasjob.co" rel="nofollow">https://hasjob.co</a> hoping to find trends.<p>There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.<p>UPDATE: there is a logfile setting to dump output to file
I have a project which includes a huge list of websites which must be scraped heavily. My question is...
Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?
Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: <a href="https://github.com/brandicted/scrapy-webdriver" rel="nofollow">https://github.com/brandicted/scrapy-webdriver</a>). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...
This is great. However I have one bone to pick(or rather know if its been taken care of)
Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div.
For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape.
Is this taken care of by taking the common denominator from many sample pages?
Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.<p>I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.
Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.
From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?
I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.<p>Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.<p><a href="http://www.visualwebripper.com" rel="nofollow">http://www.visualwebripper.com</a>
Here's an open source web scraping GUI I wrote a while back
<a href="https://github.com/jjk3/scrape-it-screen-scraper" rel="nofollow">https://github.com/jjk3/scrape-it-screen-scraper</a><p>I'm still integrating the browser engine which I was able to procure for open source purposes.<p>The video is quite old.