I found a few leads googling around Palo Alto Networks docs website:<p>- "Advanced URL Filtering" seems to have a feature where web content is either can be evaluated "inline" or "web payload data is also submitted to Advanced URL Filtering in the cloud" [1].<p>- If a URL is considered 2 spooky to load on the user's endpoint, it can instead be loaded via "Remote Browser Isolation" in a remote-desktop-like session, on demand, for that single page only [2].<p>I think either (or both) could explain the signals you're detecting.<p>[1]: <a href="https://docs.paloaltonetworks.com/advanced-url-filtering/administration/url-filtering-basics/how-url-filtering-works#:~:text=If%20the%20URL%20displays%20risky%20or%20malicious%20characteristics%2C%20the%20web%20payload%20data%20is%20also%20submitted%20to%20Advanced%20URL%20Filtering%20in%20the%20cloud%20for%20real%2Dtime%20analysis%20and%20generates%20additional%20analysis%20data" rel="nofollow">https://docs.paloaltonetworks.com/advanced-url-filtering/adm...</a>.<p>[2]: <a href="https://docs.paloaltonetworks.com/advanced-url-filtering/administration/url-filtering-features/integrate-with-a-remote-browser-isolation-rbi-provider" rel="nofollow">https://docs.paloaltonetworks.com/advanced-url-filtering/adm...</a>
Ex-PANW here. It's almost certainly the firewall's URL Filtering feature (aka PAN-DB).<p>When someone makes an HTTP request, the firewall takes the host and path from the request and looks them up first in a local cache on the data plane, then in the cloud. (As you can imagine, bypassing the entire feature is therefore trivial for malware. You just open a connection to an arbitrary IP address and put, say, google.com in the host header. As far as the firewall can tell, you are in fact talking to google.com.)<p>When the URL isn't already known to the cloud, or hasn't been visited more recently than its TTL, it goes into a queue to be refreshed by the crawler, which will make its way there shortly thereafter to classify the page.<p>Palo Alto has other URL scanners, but none that would reliably visit the page <i>after</i> the user. URLs carved out of SMTP traffic, for example, would mostly be visited before the real user, not after.
Might as well be a browser extension.<p>I remember setting up a Confluence server which was only used by me, but had public access (still password protected).<p>When checking the logs, I noticed an external IP trying to access pages which I had accessed previously, but they got redirected to the log-in page. The paths were very specific, some which I had bookmarked, so it was clear that there was an extension logging my browsing and some server or person then tried to access my pages.
Could it be a MitM "enterprise browser" like Talon or Island, and/or related browser extensions?<p><a href="https://www.paloaltonetworks.com/company/press/2023/palo-alto-networks--closes-talon-cyber-security-acquisition-and-will-offer-complimentary-enterprise-browser-to-qualified-sase-ai-customers" rel="nofollow">https://www.paloaltonetworks.com/company/press/2023/palo-alt...</a><p><i>> Dec. 28, 2023 Palo Alto Networks .. announced that it has completed the acquisition of Talon Cyber Security, a pioneer of enterprise browser technology ... Talon's Enterprise Browser will provide additional layers of protection against phishing attacks, web-based attacks and malicious browser extensions. Talon also offers extensive controls to help ensure that sensitive data does not escape the confines of the browser.</i><p><a href="https://www.island.io/product" rel="nofollow">https://www.island.io/product</a><p><pre><code> Set hyper-granular policies ... boundaries across all users, devices, apps, networks, locations, & assets
Log any and all browser behavior, review screenshots of critical actions, & trace incidents down to the click
Critical security tools embedded into the browser: like native browser isolation, automatic phishing protection, & web filtering</code></pre>
> Palo Alto Networks ... after reading product page after product page, we couldn’t work out exactly what product it was<p>Well that definitely tracks.
I remember I worked somewhere where they had something like this. Most people had windows machines, but I had a mac that I had installed.<p>My machine wanted me to accept a client certificate from palo alto networks.<p>I did not and kept refusing.<p>I think they had some sort of intrusive mitm proxy that filtered everything everyone was doing/browsing.
It could be a chat preview generator. Users DM links to some internal project pages in an chat tool and the tool fetches the page in the background in an attempt to render a preview.
Same thing happened with my work computer in the office network with a MITM HTTPS firewall. The IP address jumps between the coasts randomly confusing the Windows weather widget. Images failing to load on a lot of websites because the IP address change triggers something in their CDN. Everything is working fine when I'm WFHing so it has to be the office network.<p>Oh and this can also happen when a mobile user is jumping off their home wifi network to a internationally roaming data card. Why they would do that? Because data is cheaper this way, or they are actually tourists. So please do not block users just because they are doing this teleportation dance.
Here's my wild guess:<p>Some other code running in the browser window (probably a browser extension, but possibly another script tag in the page, inserted by an intermediate firewall/proxy) is doing this. It could be corporate spyware (i.e. forced on users by the IT department), or an extension that only tends to be used by large institutions (because it relates to some expensive enterprise product). Alternatively, it could be a much more popular browser extension, but it only executes this capture when it determines that the user is within a target list of large institutions.<p>I'm making the same guess as the author about the execution process: that the code is shipping a huge amount of page content to a cloud server, e.g. the full DOM, and then rendering that DOM in this older Chrome version. It's <i>not</i> fetching the same page from the origin server, which is how it's able to do this without auth cookies.<p>As part of rendering, the page's script tags all get executed again, which is why Upollo is seeing this. (Note that I don't know if this re-execution of script tags is deliberate. There's a good chance that it's an unintended side-effect of loading the DOM into Chrome, but it doesn't seem to break anything so nobody's bothered to disable it.)<p>It's only sampling a small percentage of executions, which is why it's not continually happening for every interaction by these users.<p>It's waiting ten seconds so that the page's network interactions are likely to have finished by then. Waiting longer would increase the odds of the user navigating to another page before the code has had a chance to run.<p>The article doesn't say if there are particular kinds of pages being grabbed, but looking for commonality between them would help.<p>The main thing that stumps me – assuming I've understood it correctly – is why the second render is happening across such a diverse set of cloud networks.
"Palo Alto Networks" is something that shows up clearer than anything else in my lighttpd logs, as they include the "we're palo alto networks doing research, contact us here(email) for us not to scan" in http request headers. They appear to do full ipv4 range scan many times a day IIRC.<p>Funnily enough I got motivated to try to make my crawler show up the same way in my own server logs by just raw scan breadth, IE by hitting so many servers I'd see my own crawler in the logs without any kind of targeting. As a kind of "planetary level experiment" source of curiosity.<p>Had to tweak masscan settings till my crappy router could keep up with the routing load. Ended up with something like 500 addresses / sec, which pales in comparison to the best hardware used for this which when combined with masscan, scans the ipv4 space in 6 minutes.
Managed to scan 1% of the IPV4 space while I slept before I started to get seriously throttled and got a quite angry email from my ISP. Just told them "Oh thanks for noticing, I now fixed the offending device" (pressed Control+C) and never ran the scan again lol.<p>Ran the scan with masscan with no blacklist. Don't recommend, at least not doing it more than once unless you get a good blacklist to follow
Aren't there systems where a server does the browsing and/or page rendering but it's controlled by terminals using other protocols?<p>Just speculatively, if someone was managing the setup of a room full of NSA analysts browsing for OSINT, how would they cover their tracks? What would that traffic look like?
Sounds like it could easily be the Cisco umbrella junk a few gov/universities have had that I’ve seen. They install MITM CAs[0] on managed hardware so they can definitely see page content.<p>[0] <a href="https://docs.umbrella.com/deployment-umbrella/docs/install-cisco-umbrella-root-certificate" rel="nofollow">https://docs.umbrella.com/deployment-umbrella/docs/install-c...</a><p>Edited to add link to docs.
I spoke w. a Palo Alto vendor rep a few months ago. We were talking about the features of the firewall appliance one of my clients was using.<p>They have a feature that effectively "tests" what the user is about to load in a virtual environment, and sees if that content behaves abnormally. I forgot what they called it. It sounds like this could be it.
A lot of larger orgs (Universities etc) use Palo Alto Networks Global Protect VPN as their VPN for accessing the orgs intranets.<p>Maybe related somehow to that?
> obviously some kind of security system<p>I don't know where the "security" bit comes from, but this is, to me, obviously web scrapping
Could it be a "read it later" type of article reader/storage service? I know of at least one that fits the bill in that it uploads locally-viewed HTML to a server which then renders that page in a headless Chrome instance for archival:<p>I've recently been wondering how Omnivore, unlike e.g. Pocket, is able to store paywalled content (for which I have a subscription) on iOS when saving it via the Omnivore app target in the share sheet, but not when directly pasting the target URL in the webapp or iOS app.<p>Turns out that sharing to an iOS app actually enables [1] the app to run JavaScript <i>in the Safari web context of the displayed page</i>, including cookies and everything!<p>If I'm skimming the client and server source code correctly, it does just that: It seems to serialize and upload the HTML of the page [2] and then invokes Puppeteer on the server [3]. Puppeteer is a scriptable/headless Chrome – that would fit the bill of "an outdated Chrome running in a data center"!<p>Omnivore can also be self-hosted since both client and server are open-source; that would explain you seeing multiple data center IPs.<p>[1] <a href="https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/ExtensionScenarios.html" rel="nofollow">https://developer.apple.com/library/archive/documentation/Ge...</a><p>[2] <a href="https://github.com/omnivore-app/omnivore/blob/main/apple/Sources/ShareExtension/ShareExtension.js#L108">https://github.com/omnivore-app/omnivore/blob/main/apple/Sou...</a><p>[3] <a href="https://github.com/omnivore-app/omnivore/blob/57aca545388904c77716b808cae03cacc56302e6/packages/api/src/utils/parser.ts#L112">https://github.com/omnivore-app/omnivore/blob/57aca545388904...</a>
I wonder if this could be iCloud Private Relay? It appears that it's effectively a VPN with some redirection layers that change often, though I don't know the exact details.
I'm missing something<p>> strange devices show up for some of our customers' users<p>> how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?<p>Either<p>- The customer has screwed up user auth big time and some X knows that.... lets go with no<p>- OP's data is wrong or they are reading it wrong<p>- They are explaining it badly.