This is how the very first version of Selenium worked. The application under test was in an iframe, and the test controller was in the parent page. The Selenium "Remote Control" protocol was later added where the controller would phone home to a listening web server for commands to relay to the iframe (basically, AJAX before it had a name. It all mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time. Ultimately, we merged with the WebDriver project, which was implemented in a more native way, avoiding all the limitations of automation-via-iframe.
Yeesh, everyone is so critical here. It's just a blog post about how somebody does occasional one-off scraping across multiple pages using browser devtools.<p>Yes of course injecting an iframe into a third-party site with devtools isn't going to replace Selenium. But it's a clever little hack in a pinch. No need to get upset.
Wow, I just posted this, went to take a nap, got back and 84 points ¿?!<p>The site is working now, it was retrieving some localhost scripts.<p>I was just trying to get some feedback and to check if that document was interesting, because I was spending a lot of time on it.<p>For what I see, it seems Selenium does exactly the same, but I would choose this iframe solution (small-medium projects) anyway.<p>It's a super small tool that do the job.<p>Please, let me know if you can fully use airovic.com
Won't the same origin policy kick in the moment I try to read the content of an iframe that isn't on the same origin as my website?<p>Or is this meant to run on the dev console on the target website? In which case, the iFrame and the Airovic website doesn't make sense (the electron app mentioned does sure, but it doesn't exist)
Okay, the way I see this is that using headless tools like puppeteer or selenium is tedious; just trying to... er scrape my HN account's favorites (AFAIK no API) becomes a task when you have to automate login.<p>Just typing in and pressing the button is much easier than automating the task, so that's why the iframe is something useful. You can interact with the content (without code).
This sounds hilariously n00by because it's VB and Internet Explorer, but creating an Internet Explorer instance through VB in, say, Excel and then dumping data into Excel was great because I had full control over my IE instance.<p>Okay, I'll stop speaking now and revealing the fact I started my career as a data guy at a giant corporation instead of a software engineer.
I can do everything listed in benefits with puppeteer, while I can’t even make sense of what iframe is supposed to achieve here, or how it’s even gonna load (anyone with a shred of sense would set X-Frame-Options to SAMEORIGIN, subject to exceptions). The airovic.com site doesn’t work and hilariously attempts to load two seemingly important scripts from localhost...<p>I’m very confused about this submission, and even more confused about how it managed to almost top the front page.<p>Edit: Having read the code samples, it seems the code snippets are supposed be run from the same origin in the dev console. A quick and dirty way to interactively scrape without navigation, I guess? Still not sure what the “all together: Airovic.com” is supposed to mean, and definitely more limited than puppeteer.<p>Edit2: To be fair to the author, they did say<p>> You cannot bypass their protections without using a HTTP Tunneling component.<p>Which I didn’t see until just now. This is a pretty big caveat though, should probably be more upfront...
For this kind of tasks I usually create a private Firefox extension which gives me access to extended browser capabilities and the ability to lift some security-related restrictions. I run it in a sandboxed browser, much like I would do with something like Selenium or Puppeter, but I have much more options to hand-tune the automation.
Depending on the nature of the content being scraped, you can use the `sandbox` attribute to the iFrame to prevent scripts from running.<p><a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe#attr-sandbox" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if...</a><p>This was useful for a brief period when I ran a news aggregator that used iFrames to display content from other news websites. Adding the sandbox attribute prevented scripts, ads, modals, etc.<p>For the purpose of scraping, unless you're always on the same domain(or running a proxy to add CORS), I don't see how an iFrame is better than either a web extension or a backend script using Puppeteer.
Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?<p>Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.<p>Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.
<p><pre><code> $iframe.contents().find('.result-row').each(function(){
data.push({
title: $(this).find('.result-title').text(),
img: $(this).find('img').attr("src"),
price: $(this).find('.result-price:first').text()
});
// And everything starts running when you set first iframe's target url
$iframe.prop("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa");
</code></pre>
Looks like he wants output something like<p><pre><code> title:
img:
price:
</code></pre>
I tried reproducing this example without using Javascript, instead using curl and sed. The output is<p><pre><code> image:
title:
price
</code></pre>
I did not try to move "title:" above "image:" though I bet this could be done using the hold space.
Nor did I format this as JSON though that would be easy to do.<p><pre><code> n=0;while true;do test $n -le 3000||break;
curl https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa?s=$n|sed -n '
/result-title hdrlnk/{s/.*\">/title: /;s/<.*//;/^title: /p;};
/./{/result-meta/,/\/span/{/result-price/s/.*\">/price: /;s/<.*//;/price/p;};};
/data-ids=\"/{s|1:[^,\">]*|https://images.craigslist.org/&_600x450.jpg|g;s/,/, /g;
s/1://g;s/>//;s/.*data-ids=/image: /;/^image: /p;}'
n=$((n+120));done</code></pre>
I've done something similar in Firefox with scratchpad. The main reason is simply convenience. I don't need to switch to a different workflow, I merely bring up scratchpad (I often already have a window open with some utility functions) and can start hacking away immediately.<p>Sadly scratchpad is going away soon. Fortunately the console now has a multiline mode, unfortunately it's not as convenient for this use.
You can even inject a browser extension to chrome with selenium, or even back the selenium with an upstream proxy. So why iframe, what's the edge?
Maybe I'm missing something obvious, but can anyone explain to me how this is better than using a tool like selenium for scraping? I guess this might be easier to quickly setup and play around with for one-off scraping?
I describe something very similar here: <a href="https://github.com/jawj/web-scraping-for-researchers" rel="nofollow">https://github.com/jawj/web-scraping-for-researchers</a>
Working for a big tech company, stuff like this infuriates me.<p>It’s exactly why we’re currently pushing for the ability to disable developer tools, we want it added to Chrome and other browsers. I should be able to, as a web site owner, not allow any kind of developer tool usage.<p>Users do not own our product and have no right to go poking around like this!