I've found JXA useful for simple web automation.<p>There is a site where I want to go to a few pages and extract some data. Their terms of service allow this as long as you do not access the site faster than a human browsing would do.<p>I used to do it with curl, with suitable delays between each page fetch.<p>Then they apparently got hit with some DOS attacks and started using Cloudflare protection and curl no longer worked.<p>I then switched to Selenium. That got stuck at hCaptcha, refusing to go on no matter how many times I correctly identified all the buses or trains or whatever.<p>Adding some experimental options in Selenium got around the hCaptcha loop:<p><pre><code> options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--disable-blink-features=AutomationControlled")
</code></pre>
A while later though the site started occasionally bringing up that Cloudflare "Checking your browser before accessing <site_name>" automatic check. That just sits there, occasionally reloading.<p>Same thing if I use undetected-chromedriver [1] which is supposed to be less likely to trigger anti-bot services.<p>Next up was using a bookmarklet. It was easy to make a bookmarklet that used an XMLHttpRequest to post the content of document.documentElement.innerHTML to a server of mine for processing. That bookmarklet could then advance to the next page of interest. It wouldn't be as nice as the prior solutions because I'd have to click the bookmarket for every page, but that wouldn't be too bad.<p>The bigger annoyance there is if my server isn't HTTPS the XMLHttpRequest is blocked because of mixed content. In Chrome you can tell it to ignore that on a per site basis, but it isn't persistent across browser launches. There are also issues if the server is on my LAN, which runs into the "not let JavaScript on a page from the internet access the LAN because it might be trying to hack your IoT" stuff.<p>I was just starting to look into browser extensions to see if that would be a viable approach. I don't know if (1) an extension can save to the local filesystem (which I'd prefer instead of having to do that XMLHttpRequest runaround), or (2) if an extension can carry out actions on multiple pages without user intervention (if I still would have to do the click per page I can just stick with the bookmarklet approach).<p>I was even contemplating trying to make a fork of Chromium or Firefox with built-in support for this stuff.<p>Then I remembered AppleScript, and that things that support AppleScript can also be controlled with JavaScript which is much much much less weird than AppleScript.<p>Here's some JXA code that saves current tab's source from Safari:<p><pre><code> var cur = Application.currentApplication()
cur.includeStandardAdditions = true;
var app = Application('Safari');
app.includeStandardAdditions = true;
function save_page(file_name)
{
let source = app.windows[0].currentTab.source()
let path = Path(file_name)
let f = cur.openForAccess(path, {writePermission: true})
cur.setEof(f, {to: 0})
cur.write(source, {to: f, startingAt: 0})
cur.closeAccess(f)
}
save_page("/tmp/page.html")
</code></pre>
Here are the changes to do that for Chrome (sort of...it doesn't quite get the full source):<p><pre><code> 3c3
< var app = Application('Safari');
---
> var app = Application('Google Chrome');
8c8,9
< let source = app.windows[0].currentTab.source()
---
> let ct = app.windows[0].activeTab
> let source = app.execute(ct, {javascript: "document.documentElement.innerHTML"})
</code></pre>
Changing pages is also easy. For Safari,<p><pre><code> app.windows[0].currentTab.url = "..."
</code></pre>
I don't remember what it is for Chrome, but it is similarly easy. And on both there is also the option of injecting JavaScript onto the page to do the navigation, if for some reason just setting the URL isn't good enough.<p>(No Firefox examples because Firefox has absolutely abysmal AppleScript support).<p>With JXA then I can write a handful of short command line utilities that can get the current URL, get the current page content, and navigate to a given URL and I've got everything I need.<p>(There might be a way to sort of make it work with Firefox. I believe (but am not certain) that with JXA I can tell the system send mouse-click events to anywhere I want in an application window. Firefox does provide enough AppleScript support to get the title of the current page. So maybe the bookmarklet approach, with JXA watching for the page changes and clicking the bookmarklet, would work).<p>[1] <a href="https://github.com/ultrafunkamsterdam/undetected-chromedriver" rel="nofollow">https://github.com/ultrafunkamsterdam/undetected-chromedrive...</a>