TechEcho

5 comments

Dunedanabout 6 years ago

> Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].[1]: <a href="https://scrapy.org/" rel="nofollow">https://scrapy.org/</a>[2]: <a href="https://www.crummy.com/software/BeautifulSoup/" rel="nofollow">https://www.crummy.com/software/BeautifulSoup/</a>

评论 #19346416 未加载

评论 #19345934 未加载

评论 #19345936 未加载

评论 #19348653 未加载

评论 #19345939 未加载

评论 #19346395 未加载

TicklishTigerabout 6 years ago

I wish there was an easy way to send commands to the console of a browser.That would be all I need to satisfy all my browser automation tasks.Without installing and learning any frameworks.Say there was a linux command 'SendToChromium' that would do that for Chromium. Then to navigate to some page one could simply do this:SendToChromium location.href="/somepage.html"SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:SendToChromium document.body.innerHTML > theBody.htmlIdeally the browser would listen for this type of command on a local port. So instead of needing a binary 'SendToChromium' one could simply start Chromium in listening mode:chromium --listen 12345And then talk to it via http:curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"

评论 #19345949 未加载

评论 #19345964 未加载

评论 #19345947 未加载

评论 #19346432 未加载

评论 #19345910 未加载

评论 #19346246 未加载

aboutrubyabout 6 years ago

Interesting but seems less powerful than my current setup:- I have mitmproxy to capture the traffic / manipulate the traffic- I have Chrome opened with Selenium/Capybara/chromedriver and using mitmproxy- I then browse to the target pages, it records the selected requests and the selected responses- It then replays the requests until they fail (with a delay)I highly recommend mitmproxy, it's extremely powerful: capture traffic, send responses without hitting the server, block/hang requests, modify responses, modify requests/responses headers.Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.I could make a blog post/demo if people are interested

评论 #19346387 未加载

评论 #19346343 未加载

CGamesPlayabout 6 years ago

I'm going to plug my app that does scraping with Electron: <a href="https://github.com/CGamesPlay/chronicler" rel="nofollow">https://github.com/CGamesPlay/chronicler</a>To the commenters who don't understand why this is necessary:- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.- It handles resources loaded through JavaScript, including HTML5 History API changes.

评论 #19349188 未加载

SSchickabout 6 years ago

Are there any other advantages over things like webdriver or puppeteer?

评论 #19345873 未加载

评论 #19345818 未加载

评论 #19345977 未加载

5 comments

Dunedanabout 6 years ago

评论 #19346416 未加载

评论 #19345934 未加载

评论 #19345936 未加载

评论 #19348653 未加载

评论 #19345939 未加载

评论 #19346395 未加载

TicklishTigerabout 6 years ago

评论 #19345949 未加载

评论 #19345964 未加载

评论 #19345947 未加载

评论 #19346432 未加载

评论 #19345910 未加载

评论 #19346246 未加载

aboutrubyabout 6 years ago

评论 #19346387 未加载

评论 #19346343 未加载

CGamesPlayabout 6 years ago

评论 #19349188 未加载

SSchickabout 6 years ago

Are there any other advantages over things like webdriver or puppeteer?

评论 #19345873 未加载

评论 #19345818 未加载

评论 #19345977 未加载

Web Scraping with Electron

5 comments

Web Scraping with Electron

5 comments