TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Web Scraping with Electron

57 pointsby tazeg95about 6 years ago

5 comments

Dunedanabout 6 years ago
&gt; Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?<p>Yes, of course! To get the source code of a web site you don&#x27;t need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.<p>If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].<p>[1]: <a href="https:&#x2F;&#x2F;scrapy.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapy.org&#x2F;</a><p>[2]: <a href="https:&#x2F;&#x2F;www.crummy.com&#x2F;software&#x2F;BeautifulSoup&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.crummy.com&#x2F;software&#x2F;BeautifulSoup&#x2F;</a>
评论 #19346416 未加载
评论 #19345934 未加载
评论 #19345936 未加载
评论 #19348653 未加载
评论 #19345939 未加载
评论 #19346395 未加载
TicklishTigerabout 6 years ago
I wish there was an easy way to send commands to the console of a browser.<p>That would be all I need to satisfy all my browser automation tasks.<p>Without installing and learning any frameworks.<p>Say there was a linux command &#x27;SendToChromium&#x27; that would do that for Chromium. Then to navigate to some page one could simply do this:<p>SendToChromium location.href=&quot;&#x2F;somepage.html&quot;<p>SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:<p>SendToChromium document.body.innerHTML &gt; theBody.html<p>Ideally the browser would listen for this type of command on a local port. So instead of needing a binary &#x27;SendToChromium&#x27; one could simply start Chromium in listening mode:<p>chromium --listen 12345<p>And then talk to it via http:<p>curl 127.0.0.1:12345&#x2F;execute?command=location.href=&quot;&#x2F;somepage.html&quot;
评论 #19345949 未加载
评论 #19345964 未加载
评论 #19345947 未加载
评论 #19346432 未加载
评论 #19345910 未加载
评论 #19346246 未加载
aboutrubyabout 6 years ago
Interesting but seems less powerful than my current setup:<p>- I have mitmproxy to capture the traffic &#x2F; manipulate the traffic<p>- I have Chrome opened with Selenium&#x2F;Capybara&#x2F;chromedriver and using mitmproxy<p>- I then browse to the target pages, it records the selected requests and the selected responses<p>- It then replays the requests until they fail (with a delay)<p>I highly recommend mitmproxy, it&#x27;s extremely powerful: capture traffic, send responses without hitting the server, block&#x2F;hang requests, modify responses, modify requests&#x2F;responses headers.<p>Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.<p>I could make a blog post&#x2F;demo if people are interested
评论 #19346387 未加载
评论 #19346343 未加载
CGamesPlayabout 6 years ago
I&#x27;m going to plug my app that does scraping with Electron: <a href="https:&#x2F;&#x2F;github.com&#x2F;CGamesPlay&#x2F;chronicler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;CGamesPlay&#x2F;chronicler</a><p>To the commenters who don&#x27;t understand why this is necessary:<p>- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.<p>- It handles resources loaded through JavaScript, including HTML5 History API changes.
评论 #19349188 未加载
SSchickabout 6 years ago
Are there any other advantages over things like webdriver or puppeteer?
评论 #19345873 未加载
评论 #19345818 未加载
评论 #19345977 未加载