TechEcho

15 comments

papsover 7 years ago

Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js"). I think that would simplify your evaluate() calls.It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.BTW we just released an article with tips & tricks for Headless Chrome: <a href="https://blog.phantombuster.com/web-scraping-in-2017-headless-chrome-tips-tricks-4d6521d695e8" rel="nofollow">https://blog.phantombuster.com/web-scraping-in-2017-headless...</a> What do you think?

评论 #15124427 未加载

Giroflexover 7 years ago

> Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. The prominent of these are PhantomJS and Selenium IDE for Firefox.Correct me if I'm wrong, but if I'm notm mistaken Selenium IDE has been discontinued due to lack of mantainers, and that has little if any relation to Chrome Headless.The IDE is just a more effective way of programming test behavior; the Selenium webdriver is still up and working with straight code (as is the case of this tutorial).

评论 #15124802 未加载

评论 #15121990 未加载

评论 #15122034 未加载

评论 #15121998 未加载

ankit84over 7 years ago

Great tutorial! Also, you look like a Full stack. How's the reception for Hospital Run software you worked on? (<a href="https://github.com/HospitalRun/hospitalrun-frontend" rel="nofollow">https://github.com/HospitalRun/hospitalrun-frontend</a>)""Somewhat similar is the case with Internet that we traversed today in quest of data.""

评论 #15122541 未加载

twstedover 7 years ago

Two things:1. Please do not test a web app with Chrome only, we don't want to go back to a world with a single browser2. > So, until puppeteer supports this, we will rely on jsdom, a package available via npmJSDOM is not just a package on npm, it's an engineering piece of art

评论 #15124780 未加载

评论 #15133547 未加载

评论 #15123382 未加载

vebover 7 years ago

Ooooh I read this fantastic introduction the other day and wrote this wee HN demo using Cheerio. <a href="https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320" rel="nofollow">https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320</a>

testcrossover 7 years ago

Do you know if it is possible to render a page without serving it from a web server? For example, I have the html of one page of my domain generated by a test. I would like to use puppeteer to render it. But I don't want to setup a http server for this. I would like to give a string with the html + a url to page.goto and let it render the page like it comes from the real server.I guess I can cheat by intercepting the request and respond with the html I already have. But I wonder if there is already something existing.

评论 #15123876 未加载

评论 #15123050 未加载

评论 #15123374 未加载

garouover 7 years ago

I am writing almost the same thing but for PDF [1]. But I am having trouble with scaling.I got able to make it run inside a docker.In this exact moment the example at the repo is just returning a blank PDF but the problem is at the API Gateway.[1] <a href="https://github.com/tecnospeed/pastor" rel="nofollow">https://github.com/tecnospeed/pastor</a>

MrBlueover 7 years ago

Puppeteer is definitely cool but on a recent project I had to revert back to using NightmareJS as I needed to download files.

评论 #15122183 未加载

gmacover 7 years ago

A simple option for web scraping is just to use the developer console in a real web browser.I have a repo outlining the basics here: <a href="https://github.com/jawj/web-scraping-for-researchers" rel="nofollow">https://github.com/jawj/web-scraping-for-researchers</a>

jasan_sover 7 years ago

Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?

testcrossover 7 years ago

Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven't find any information about that

评论 #15122539 未加载

naveedahmada036over 7 years ago

I can write mini script to scrape emails and github, what's up about this hype?

dchukover 7 years ago

Correction: it's "scraping"

评论 #15122004 未加载

desireco42over 7 years ago

I tried it out when it was released, it works well and it is decently fast.

评论 #15122489 未加载

kasbahover 7 years ago

Seems like most of the parsing is done by JSDOM in this tutorial.

评论 #15122474 未加载

15 comments

papsover 7 years ago

评论 #15124427 未加载

Giroflexover 7 years ago

评论 #15124802 未加载

评论 #15121990 未加载

评论 #15122034 未加载

评论 #15121998 未加载

ankit84over 7 years ago

评论 #15122541 未加载

twstedover 7 years ago

评论 #15124780 未加载

评论 #15133547 未加载

评论 #15123382 未加载

vebover 7 years ago

testcrossover 7 years ago

评论 #15123876 未加载

评论 #15123050 未加载

评论 #15123374 未加载

garouover 7 years ago

MrBlueover 7 years ago

Puppeteer is definitely cool but on a recent project I had to revert back to using NightmareJS as I needed to download files.

评论 #15122183 未加载

gmacover 7 years ago

jasan_sover 7 years ago

Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?

testcrossover 7 years ago

Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven't find any information about that

评论 #15122539 未加载

naveedahmada036over 7 years ago

I can write mini script to scrape emails and github, what's up about this hype?

dchukover 7 years ago

Correction: it's "scraping"

评论 #15122004 未加载

desireco42over 7 years ago

I tried it out when it was released, it works well and it is decently fast.

评论 #15122489 未加载

kasbahover 7 years ago

Seems like most of the parsing is done by JSDOM in this tutorial.

评论 #15122474 未加载

Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping

15 comments

Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping

15 comments