TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping

142 pointsby emadehsanover 7 years ago

15 comments

papsover 7 years ago
Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.<p>For example with Puppeteer you can do page.injectFile(&quot;jquery-3.2.1.min.js&quot;). I think that would simplify your evaluate() calls.<p>It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.<p>BTW we just released an article with tips &amp; tricks for Headless Chrome: <a href="https:&#x2F;&#x2F;blog.phantombuster.com&#x2F;web-scraping-in-2017-headless-chrome-tips-tricks-4d6521d695e8" rel="nofollow">https:&#x2F;&#x2F;blog.phantombuster.com&#x2F;web-scraping-in-2017-headless...</a> What do you think?
评论 #15124427 未加载
Giroflexover 7 years ago
&gt; Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. The prominent of these are PhantomJS and Selenium IDE for Firefox.<p>Correct me if I&#x27;m wrong, but if I&#x27;m notm mistaken Selenium IDE has been discontinued due to lack of mantainers, and that has little if any relation to Chrome Headless.<p>The IDE is just a more effective way of programming test behavior; the Selenium webdriver is still up and working with straight code (as is the case of this tutorial).
评论 #15124802 未加载
评论 #15121990 未加载
评论 #15122034 未加载
评论 #15121998 未加载
ankit84over 7 years ago
Great tutorial! Also, you look like a Full stack. How&#x27;s the reception for Hospital Run software you worked on? (<a href="https:&#x2F;&#x2F;github.com&#x2F;HospitalRun&#x2F;hospitalrun-frontend" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;HospitalRun&#x2F;hospitalrun-frontend</a>)<p>&quot;&quot;Somewhat similar is the case with Internet that we traversed today in quest of data.&quot;&quot;
评论 #15122541 未加载
twstedover 7 years ago
Two things:<p>1. Please do not test a web app with Chrome only, we don&#x27;t want to go back to a world with a single browser<p>2. &gt; So, until puppeteer supports this, we will rely on jsdom, a package available via npm<p>JSDOM is not just a package on npm, it&#x27;s an engineering piece of art
评论 #15124780 未加载
评论 #15133547 未加载
评论 #15123382 未加载
vebover 7 years ago
Ooooh I read this fantastic introduction the other day and wrote this wee HN demo using Cheerio. <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;veb&#x2F;c1beab69b5eb1b07123e5eaf55b80320" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;veb&#x2F;c1beab69b5eb1b07123e5eaf55b80320</a>
testcrossover 7 years ago
Do you know if it is possible to render a page without serving it from a web server? For example, I have the html of one page of my domain generated by a test. I would like to use puppeteer to render it. But I don&#x27;t want to setup a http server for this. I would like to give a string with the html + a url to page.goto and let it render the page like it comes from the real server.<p>I guess I can cheat by intercepting the request and respond with the html I already have. But I wonder if there is already something existing.
评论 #15123876 未加载
评论 #15123050 未加载
评论 #15123374 未加载
garouover 7 years ago
I am writing almost the same thing but for PDF [1]. But I am having trouble with scaling.<p>I got able to make it run inside a docker.<p>In this exact moment the example at the repo is just returning a blank PDF but the problem is at the API Gateway.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tecnospeed&#x2F;pastor" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tecnospeed&#x2F;pastor</a>
MrBlueover 7 years ago
Puppeteer is definitely cool but on a recent project I had to revert back to using NightmareJS as I needed to download files.
评论 #15122183 未加载
gmacover 7 years ago
A simple option for web scraping is just to use the developer console in a real web browser.<p>I have a repo outlining the basics here: <a href="https:&#x2F;&#x2F;github.com&#x2F;jawj&#x2F;web-scraping-for-researchers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jawj&#x2F;web-scraping-for-researchers</a>
jasan_sover 7 years ago
Tried Puppetter, Its pretty awesome. I&#x27;m a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?
testcrossover 7 years ago
Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven&#x27;t find any information about that
评论 #15122539 未加载
naveedahmada036over 7 years ago
I can write mini script to scrape emails and github, what&#x27;s up about this hype?
dchukover 7 years ago
Correction: it&#x27;s &quot;scraping&quot;
评论 #15122004 未加载
desireco42over 7 years ago
I tried it out when it was released, it works well and it is decently fast.
评论 #15122489 未加载
kasbahover 7 years ago
Seems like most of the parsing is done by JSDOM in this tutorial.
评论 #15122474 未加载