TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Web Scraping with Electron

57 点作者 tazeg95大约 6 年前

5 条评论

Dunedan大约 6 年前
&gt; Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?<p>Yes, of course! To get the source code of a web site you don&#x27;t need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.<p>If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].<p>[1]: <a href="https:&#x2F;&#x2F;scrapy.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapy.org&#x2F;</a><p>[2]: <a href="https:&#x2F;&#x2F;www.crummy.com&#x2F;software&#x2F;BeautifulSoup&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.crummy.com&#x2F;software&#x2F;BeautifulSoup&#x2F;</a>
评论 #19346416 未加载
评论 #19345934 未加载
评论 #19345936 未加载
评论 #19348653 未加载
评论 #19345939 未加载
评论 #19346395 未加载
TicklishTiger大约 6 年前
I wish there was an easy way to send commands to the console of a browser.<p>That would be all I need to satisfy all my browser automation tasks.<p>Without installing and learning any frameworks.<p>Say there was a linux command &#x27;SendToChromium&#x27; that would do that for Chromium. Then to navigate to some page one could simply do this:<p>SendToChromium location.href=&quot;&#x2F;somepage.html&quot;<p>SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:<p>SendToChromium document.body.innerHTML &gt; theBody.html<p>Ideally the browser would listen for this type of command on a local port. So instead of needing a binary &#x27;SendToChromium&#x27; one could simply start Chromium in listening mode:<p>chromium --listen 12345<p>And then talk to it via http:<p>curl 127.0.0.1:12345&#x2F;execute?command=location.href=&quot;&#x2F;somepage.html&quot;
评论 #19345949 未加载
评论 #19345964 未加载
评论 #19345947 未加载
评论 #19346432 未加载
评论 #19345910 未加载
评论 #19346246 未加载
aboutruby大约 6 年前
Interesting but seems less powerful than my current setup:<p>- I have mitmproxy to capture the traffic &#x2F; manipulate the traffic<p>- I have Chrome opened with Selenium&#x2F;Capybara&#x2F;chromedriver and using mitmproxy<p>- I then browse to the target pages, it records the selected requests and the selected responses<p>- It then replays the requests until they fail (with a delay)<p>I highly recommend mitmproxy, it&#x27;s extremely powerful: capture traffic, send responses without hitting the server, block&#x2F;hang requests, modify responses, modify requests&#x2F;responses headers.<p>Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.<p>I could make a blog post&#x2F;demo if people are interested
评论 #19346387 未加载
评论 #19346343 未加载
CGamesPlay大约 6 年前
I&#x27;m going to plug my app that does scraping with Electron: <a href="https:&#x2F;&#x2F;github.com&#x2F;CGamesPlay&#x2F;chronicler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;CGamesPlay&#x2F;chronicler</a><p>To the commenters who don&#x27;t understand why this is necessary:<p>- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.<p>- It handles resources loaded through JavaScript, including HTML5 History API changes.
评论 #19349188 未加载
SSchick大约 6 年前
Are there any other advantages over things like webdriver or puppeteer?
评论 #19345873 未加载
评论 #19345818 未加载
评论 #19345977 未加载