> Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?<p>Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.<p>If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].<p>[1]: <a href="https://scrapy.org/" rel="nofollow">https://scrapy.org/</a><p>[2]: <a href="https://www.crummy.com/software/BeautifulSoup/" rel="nofollow">https://www.crummy.com/software/BeautifulSoup/</a>
I wish there was an easy way to send commands to the console of a browser.<p>That would be all I need to satisfy all my browser automation tasks.<p>Without installing and learning any frameworks.<p>Say there was a linux command 'SendToChromium' that would do that for Chromium. Then to navigate to some page one could simply do this:<p>SendToChromium location.href="/somepage.html"<p>SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:<p>SendToChromium document.body.innerHTML > theBody.html<p>Ideally the browser would listen for this type of command on a local port. So instead of needing a binary 'SendToChromium' one could simply start Chromium in listening mode:<p>chromium --listen 12345<p>And then talk to it via http:<p>curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"
Interesting but seems less powerful than my current setup:<p>- I have mitmproxy to capture the traffic / manipulate the traffic<p>- I have Chrome opened with Selenium/Capybara/chromedriver and using mitmproxy<p>- I then browse to the target pages, it records the selected requests and the selected responses<p>- It then replays the requests until they fail (with a delay)<p>I highly recommend mitmproxy, it's extremely powerful: capture traffic, send responses without hitting the server, block/hang requests, modify responses, modify requests/responses headers.<p>Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.<p>I could make a blog post/demo if people are interested
I'm going to plug my app that does scraping with Electron: <a href="https://github.com/CGamesPlay/chronicler" rel="nofollow">https://github.com/CGamesPlay/chronicler</a><p>To the commenters who don't understand why this is necessary:<p>- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.<p>- It handles resources loaded through JavaScript, including HTML5 History API changes.