Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.<p>For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js"). I think that would simplify your evaluate() calls.<p>It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.<p>BTW we just released an article with tips & tricks for Headless Chrome: <a href="https://blog.phantombuster.com/web-scraping-in-2017-headless-chrome-tips-tricks-4d6521d695e8" rel="nofollow">https://blog.phantombuster.com/web-scraping-in-2017-headless...</a> What do you think?
> Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. The prominent of these are PhantomJS and Selenium IDE for Firefox.<p>Correct me if I'm wrong, but if I'm notm mistaken Selenium IDE has been discontinued due to lack of mantainers, and that has little if any relation to Chrome Headless.<p>The IDE is just a more effective way of programming test behavior; the Selenium webdriver is still up and working with straight code (as is the case of this tutorial).
Great tutorial! Also, you look like a Full stack. How's the reception for Hospital Run software you worked on? (<a href="https://github.com/HospitalRun/hospitalrun-frontend" rel="nofollow">https://github.com/HospitalRun/hospitalrun-frontend</a>)<p>""Somewhat similar is the case with Internet that we traversed today in quest of data.""
Two things:<p>1. Please do not test a web app with Chrome only, we don't want to go back to a world with a single browser<p>2. > So, until puppeteer supports this, we will rely on jsdom, a package available via npm<p>JSDOM is not just a package on npm, it's an engineering piece of art
Ooooh I read this fantastic introduction the other day and wrote this wee HN demo using Cheerio. <a href="https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320" rel="nofollow">https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320</a>
Do you know if it is possible to render a page without serving it from a web server? For example, I have the html of one page of my domain generated by a test. I would like to use puppeteer to render it. But I don't want to setup a http server for this. I would like to give a string with the html + a url to page.goto and let it render the page like it comes from the real server.<p>I guess I can cheat by intercepting the request and respond with the html I already have. But I wonder if there is already something existing.
I am writing almost the same thing but for PDF [1].
But I am having trouble with scaling.<p>I got able to make it run inside a docker.<p>In this exact moment the example at the repo is just returning a blank PDF but the problem is at the API Gateway.<p>[1] <a href="https://github.com/tecnospeed/pastor" rel="nofollow">https://github.com/tecnospeed/pastor</a>
A simple option for web scraping is just to use the developer console in a real web browser.<p>I have a repo outlining the basics here: <a href="https://github.com/jawj/web-scraping-for-researchers" rel="nofollow">https://github.com/jawj/web-scraping-for-researchers</a>
Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?
Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven't find any information about that