Portia, an open-source visual web scraper

367 pointsby pablohoffmanabout 11 years ago

21 comments

dabeeeensterabout 11 years ago

The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

评论 #7510096 未加载

评论 #7510119 未加载

评论 #7508921 未加载

评论 #7509003 未加载

评论 #7521234 未加载

评论 #7508869 未加载

评论 #7509598 未加载

评论 #7510129 未加载

bsilvereagleabout 11 years ago

I expected an April Fool's joke and found something pleasantly awesome and useful instead.Source is here: <a href="https://github.com/scrapinghub/portia" rel="nofollow">https://github.com/scrapinghub/portia</a>

climatewarrior2about 11 years ago

I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.

kh_hkabout 11 years ago

I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

评论 #7511942 未加载

compareabout 11 years ago

Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.

anilshanbhagabout 11 years ago

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at <a href="https://hasjob.co" rel="nofollow">https://hasjob.co</a> hoping to find trends.There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.UPDATE: there is a logfile setting to dump output to file

emilsedghabout 11 years ago

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

评论 #7508963 未加载

jstoikoabout 11 years ago

Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: <a href="https://github.com/brandicted/scrapy-webdriver" rel="nofollow">https://github.com/brandicted/scrapy-webdriver</a>). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...

ashwing_2005about 11 years ago

This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?

评论 #7510738 未加载

esolytabout 11 years ago

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

kelvin0about 11 years ago

Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.

oblioabout 11 years ago

Totally off topic, but what's the name of the song in the video? :)

评论 #7515510 未加载

rpedelaabout 11 years ago

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

评论 #7508565 未加载

alttababout 11 years ago

This is cool. Can I use it locally on internal sites too?

评论 #7509085 未加载

th0ma5about 11 years ago

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutzabout 11 years ago

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.<a href="http://www.visualwebripper.com" rel="nofollow">http://www.visualwebripper.com</a>

viana007about 11 years ago

This solution remembers Pyquery, but using a visual interface.

kclayabout 11 years ago

Love this, been using Scrapy for all my scraping needs.

rpedelaabout 11 years ago

Is there a live demo available?

评论 #7509549 未加载

评论 #7508598 未加载

taskstrikeabout 11 years ago

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

评论 #7509014 未加载

评论 #7509283 未加载

notastartupabout 11 years ago

Here's an open source web scraping GUI I wrote a while back <a href="https://github.com/jjk3/scrape-it-screen-scraper" rel="nofollow">https://github.com/jjk3/scrape-it-screen-scraper</a>I'm still integrating the browser engine which I was able to procure for open source purposes.The video is quite old.

21 comments

dabeeeensterabout 11 years ago

评论 #7510096 未加载

评论 #7510119 未加载

评论 #7508921 未加载

评论 #7509003 未加载

评论 #7521234 未加载

评论 #7508869 未加载

评论 #7509598 未加载

评论 #7510129 未加载

bsilvereagleabout 11 years ago

climatewarrior2about 11 years ago

kh_hkabout 11 years ago

评论 #7511942 未加载

compareabout 11 years ago

anilshanbhagabout 11 years ago

emilsedghabout 11 years ago

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

评论 #7508963 未加载

jstoikoabout 11 years ago

ashwing_2005about 11 years ago

评论 #7510738 未加载

esolytabout 11 years ago

kelvin0about 11 years ago

oblioabout 11 years ago

Totally off topic, but what's the name of the song in the video? :)

评论 #7515510 未加载

rpedelaabout 11 years ago

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

评论 #7508565 未加载

alttababout 11 years ago

This is cool. Can I use it locally on internal sites too?

评论 #7509085 未加载

th0ma5about 11 years ago

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutzabout 11 years ago

viana007about 11 years ago

This solution remembers Pyquery, but using a visual interface.

kclayabout 11 years ago

Love this, been using Scrapy for all my scraping needs.

rpedelaabout 11 years ago

Is there a live demo available?

评论 #7509549 未加载

评论 #7508598 未加载

taskstrikeabout 11 years ago

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

评论 #7509014 未加载

评论 #7509283 未加载

notastartupabout 11 years ago