I only use an iFrame to crawl and scrape content

287 pointsby natzarover 5 years ago

18 comments

hugsover 5 years ago

This is how the very first version of Selenium worked. The application under test was in an iframe, and the test controller was in the parent page. The Selenium "Remote Control" protocol was later added where the controller would phone home to a listening web server for commands to relay to the iframe (basically, AJAX before it had a name. It all mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time. Ultimately, we merged with the WebDriver project, which was implemented in a more native way, avoiding all the limitations of automation-via-iframe.

评论 #21887279 未加载

评论 #21885918 未加载

fenwick67over 5 years ago

Yeesh, everyone is so critical here. It's just a blog post about how somebody does occasional one-off scraping across multiple pages using browser devtools.Yes of course injecting an iframe into a third-party site with devtools isn't going to replace Selenium. But it's a clever little hack in a pinch. No need to get upset.

评论 #21885478 未加载

natzarover 5 years ago

Wow, I just posted this, went to take a nap, got back and 84 points ¿?!The site is working now, it was retrieving some localhost scripts.I was just trying to get some feedback and to check if that document was interesting, because I was spending a lot of time on it.For what I see, it seems Selenium does exactly the same, but I would choose this iframe solution (small-medium projects) anyway.It's a super small tool that do the job.Please, let me know if you can fully use airovic.com

captn3m0over 5 years ago

Won't the same origin policy kick in the moment I try to read the content of an iframe that isn't on the same origin as my website?Or is this meant to run on the dev console on the target website? In which case, the iFrame and the Airovic website doesn't make sense (the electron app mentioned does sure, but it doesn't exist)

评论 #21885708 未加载

评论 #21884578 未加载

pcr910303over 5 years ago

Okay, the way I see this is that using headless tools like puppeteer or selenium is tedious; just trying to... er scrape my HN account's favorites (AFAIK no API) becomes a task when you have to automate login.Just typing in and pressing the button is much easier than automating the task, so that's why the iframe is something useful. You can interact with the content (without code).

评论 #21885241 未加载

评论 #21889253 未加载

评论 #21885281 未加载

评论 #21885025 未加载

asdfman123over 5 years ago

This sounds hilariously n00by because it's VB and Internet Explorer, but creating an Internet Explorer instance through VB in, say, Excel and then dumping data into Excel was great because I had full control over my IE instance.Okay, I'll stop speaking now and revealing the fact I started my career as a data guy at a giant corporation instead of a software engineer.

评论 #21889524 未加载

oefrhaover 5 years ago

I can do everything listed in benefits with puppeteer, while I can’t even make sense of what iframe is supposed to achieve here, or how it’s even gonna load (anyone with a shred of sense would set X-Frame-Options to SAMEORIGIN, subject to exceptions). The airovic.com site doesn’t work and hilariously attempts to load two seemingly important scripts from localhost...I’m very confused about this submission, and even more confused about how it managed to almost top the front page.Edit: Having read the code samples, it seems the code snippets are supposed be run from the same origin in the dev console. A quick and dirty way to interactively scrape without navigation, I guess? Still not sure what the “all together: Airovic.com” is supposed to mean, and definitely more limited than puppeteer.Edit2: To be fair to the author, they did say> You cannot bypass their protections without using a HTTP Tunneling component.Which I didn’t see until just now. This is a pretty big caveat though, should probably be more upfront...

评论 #21884698 未加载

评论 #21885677 未加载

elierotenbergover 5 years ago

For this kind of tasks I usually create a private Firefox extension which gives me access to extended browser capabilities and the ability to lift some security-related restrictions. I run it in a sandboxed browser, much like I would do with something like Selenium or Puppeter, but I have much more options to hand-tune the automation.

ravenstineover 5 years ago

Depending on the nature of the content being scraped, you can use the `sandbox` attribute to the iFrame to prevent scripts from running.<a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe#attr-sandbox" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if...</a>This was useful for a brief period when I ran a news aggregator that used iFrames to display content from other news websites. Adding the sandbox attribute prevented scripts, ads, modals, etc.For the purpose of scraping, unless you're always on the same domain(or running a proxy to add CORS), I don't see how an iFrame is better than either a web extension or a backend script using Puppeteer.

heyplanetover 5 years ago

Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.

评论 #21884644 未加载

评论 #21884563 未加载

评论 #21884608 未加载

3xblahover 5 years ago

<pre><code> $iframe.contents().find('.result-row').each(function(){ data.push({ title: $(this).find('.result-title').text(), img: $(this).find('img').attr("src"), price: $(this).find('.result-price:first').text() }); // And everything starts running when you set first iframe's target url $iframe.prop("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa"); </code></pre> Looks like he wants output something like<pre><code> title: img: price: </code></pre> I tried reproducing this example without using Javascript, instead using curl and sed. The output is<pre><code> image: title: price </code></pre> I did not try to move "title:" above "image:" though I bet this could be done using the hold space. Nor did I format this as JSON though that would be easy to do.<pre><code> n=0;while true;do test $n -le 3000||break; curl https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa?s=$n|sed -n ' /result-title hdrlnk/{s/.*\">/title: /;s/<.*//;/^title: /p;}; /./{/result-meta/,/\/span/{/result-price/s/.*\">/price: /;s/<.*//;/price/p;};}; /data-ids=\"/{s|1:[^,\">]*|https://images.craigslist.org/&_600x450.jpg|g;s/,/, /g; s/1://g;s/>//;s/.*data-ids=/image: /;/^image: /p;}' n=$((n+120));done</code></pre>

ChrisSDover 5 years ago

I've done something similar in Firefox with scratchpad. The main reason is simply convenience. I don't need to switch to a different workflow, I merely bring up scratchpad (I often already have a window open with some utility functions) and can start hacking away immediately.Sadly scratchpad is going away soon. Fortunately the console now has a multiline mode, unfortunately it's not as convenient for this use.

kseo3lover 5 years ago

why don't use something like proxycrawl? controlling an iframe is slow and painful

GiantSullyover 5 years ago

You can even inject a browser extension to chrome with selenium, or even back the selenium with an upstream proxy. So why iframe, what's the edge?

ausjkeover 5 years ago

do not understand why iframe is a must here, why can I just scrape the whole page directly? still learning web scraping using scrapy.

评论 #21885250 未加载

thenewnewguyover 5 years ago

Maybe I'm missing something obvious, but can anyone explain to me how this is better than using a tool like selenium for scraping? I guess this might be easier to quickly setup and play around with for one-off scraping?

gmacover 5 years ago

I describe something very similar here: <a href="https://github.com/jawj/web-scraping-for-researchers" rel="nofollow">https://github.com/jawj/web-scraping-for-researchers</a>

iamleppertover 5 years ago

Working for a big tech company, stuff like this infuriates me.It’s exactly why we’re currently pushing for the ability to disable developer tools, we want it added to Chrome and other browsers. I should be able to, as a web site owner, not allow any kind of developer tool usage.Users do not own our product and have no right to go poking around like this!

评论 #22001572 未加载

评论 #21889890 未加载

评论 #21889843 未加载

评论 #21890556 未加载

评论 #21889874 未加载

评论 #21894545 未加载