Portia, an open-source visual web scraper

367 点作者 pablohoffman大约 11 年前

21 条评论

The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

评论 #7510096 未加载

评论 #7510119 未加载

评论 #7508921 未加载

评论 #7509003 未加载

评论 #7521234 未加载

评论 #7508869 未加载

评论 #7509598 未加载

评论 #7510129 未加载

bsilvereagle大约 11 年前

I expected an April Fool's joke and found something pleasantly awesome and useful instead.Source is here: <a href="https://github.com/scrapinghub/portia" rel="nofollow">https://github.com/scrapinghub/portia</a>

climatewarrior2大约 11 年前

I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.

kh_hk大约 11 年前

I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

评论 #7511942 未加载

compare大约 11 年前

Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.

anilshanbhag大约 11 年前

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at <a href="https://hasjob.co" rel="nofollow">https://hasjob.co</a> hoping to find trends.There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.UPDATE: there is a logfile setting to dump output to file

emilsedgh大约 11 年前

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

评论 #7508963 未加载

jstoiko大约 11 年前

Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: <a href="https://github.com/brandicted/scrapy-webdriver" rel="nofollow">https://github.com/brandicted/scrapy-webdriver</a>). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...

ashwing_2005大约 11 年前

This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?

评论 #7510738 未加载

esolyt大约 11 年前

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

kelvin0大约 11 年前

Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.

oblio大约 11 年前

Totally off topic, but what's the name of the song in the video? :)

评论 #7515510 未加载

rpedela大约 11 年前

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

评论 #7508565 未加载

alttab大约 11 年前

This is cool. Can I use it locally on internal sites too?

评论 #7509085 未加载

th0ma5大约 11 年前

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutz大约 11 年前

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.<a href="http://www.visualwebripper.com" rel="nofollow">http://www.visualwebripper.com</a>

viana007大约 11 年前

This solution remembers Pyquery, but using a visual interface.

kclay大约 11 年前

Love this, been using Scrapy for all my scraping needs.

rpedela大约 11 年前

Is there a live demo available?

评论 #7509549 未加载

评论 #7508598 未加载

taskstrike大约 11 年前

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

评论 #7509014 未加载

评论 #7509283 未加载

notastartup大约 11 年前

Here's an open source web scraping GUI I wrote a while back <a href="https://github.com/jjk3/scrape-it-screen-scraper" rel="nofollow">https://github.com/jjk3/scrape-it-screen-scraper</a>I'm still integrating the browser engine which I was able to procure for open source purposes.The video is quite old.

21 条评论

dabeeeenster大约 11 年前

评论 #7510096 未加载

评论 #7510119 未加载

评论 #7508921 未加载

评论 #7509003 未加载

评论 #7521234 未加载

评论 #7508869 未加载

评论 #7509598 未加载

评论 #7510129 未加载

bsilvereagle大约 11 年前

climatewarrior2大约 11 年前

kh_hk大约 11 年前

评论 #7511942 未加载

compare大约 11 年前

anilshanbhag大约 11 年前

emilsedgh大约 11 年前

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

评论 #7508963 未加载

jstoiko大约 11 年前

ashwing_2005大约 11 年前

评论 #7510738 未加载

esolyt大约 11 年前

kelvin0大约 11 年前

oblio大约 11 年前

Totally off topic, but what's the name of the song in the video? :)

评论 #7515510 未加载

rpedela大约 11 年前

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

评论 #7508565 未加载

alttab大约 11 年前

This is cool. Can I use it locally on internal sites too?

评论 #7509085 未加载

th0ma5大约 11 年前

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutz大约 11 年前

viana007大约 11 年前

This solution remembers Pyquery, but using a visual interface.

kclay大约 11 年前

Love this, been using Scrapy for all my scraping needs.

rpedela大约 11 年前

Is there a live demo available?

评论 #7509549 未加载

评论 #7508598 未加载

taskstrike大约 11 年前

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

评论 #7509014 未加载

评论 #7509283 未加载

notastartup大约 11 年前