Web scraping via JavaScript runtime heap snapshots

354 点作者 adriancooney大约 3 年前

25 条评论

anyfactor大约 3 年前

Very interesting. Can't wait to give it a shot.I personally use a combination of xpath, basic math and regex, so this class/id security solution isn't a major deterrent. Couple of times, I did find it to be an hassle to scrape data embedded in iframes, and I can see the heap snapshots treat iframes differently.Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.After all that, I do look forward using something like this and making a switch to nodejs based solution soon. But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.

评论 #31212719 未加载

评论 #31212154 未加载

scriptsmith大约 3 年前

In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.[1] <a href="https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagesetrequestinterceptionvalue" rel="nofollow">https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...</a>

评论 #31219349 未加载

评论 #31215202 未加载

superasn大约 3 年前

Awesome, I wonder if there is a possibility to create a chrome extension that works like 'Vue devttools' and show the heap and changes in real-time and maybe allow editing. That would be amazing for learning / debugging.> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.(1) <a href="https://www.npmjs.com/package/puppeteer-extra-plugin-stealth" rel="nofollow">https://www.npmjs.com/package/puppeteer-extra-plugin-stealth</a>

评论 #31210955 未加载

评论 #31212346 未加载

mdaniel大约 3 年前

That's an exceedingly clever idea, thanks for sharing it!Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: <a href="https://github.com/adriancooney/puppeteer-heap-snapshot/blob/master/package.json#L8" rel="nofollow">https://github.com/adriancooney/puppeteer-heap-snapshot/blob...</a> I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer

评论 #31209411 未加载

trinovantes大约 3 年前

If this catches on, web developers may start employing memory obscurification techniques like game developers<a href="https://technology.riotgames.com/news/riots-approach-anti-cheat" rel="nofollow">https://technology.riotgames.com/news/riots-approach-anti-ch...</a>

elbajo大约 3 年前

Love this approach, thanks for sharing!I am trying this on a website for which Puppeteer has trouble loading so I got a heap snapshot directly in Chrome. I was trying to search for relevant objects directly in the Chrome heap viewer but I don't think the search looks inside objects.I think your tool would work: "puppeteer-heap-snapshot query -f /tmp/file.heapsnapshot -p property1" or really any JSON parser but it requires extra steps. Would you say this is the easiest way to view/debug a heap snapshot?

marmada大约 3 年前

Wow this is brilliant. I've sometimes tried to reverse engineer APIs in the past, but this is definitely the next level.I used to think ML models could be good for scraping too, but this seems better.I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.

kvathupo大约 3 年前

The article brings up two interesting points for web preservation:1. The reliance on externally hosted APIs2. Source code obfuscationFor 1, in order to fully preserve a webpage, you'd have to go down the rabbit hole of externally hosted APIs, and preserve those as well. For example, sometimes a webpage won't render latex notation since a MathJax endpoint can't be connected to. Were we to save this webpage, we would need a copy of MathJax JS too.For 2, I think WASM makes things more interesting. With Web Assembly, I'd imagine it's much easier to obfuscate source code: a preservationist would need a WASM decompiler for whatever source language was used.

BbzzbB大约 3 年前

This is great, thanks a lot.It's my understanding that Playwright is the "new Puppeteer" (even with core devs migrating). I presume this sort of technique would be feasible on Playwright too? Do you think there's any advantage or disadvantage of using one over the other for this use case, or it's basically the same (or I'm off base and they're not so interchangeable)?I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.

评论 #31209972 未加载

chrismeller大约 3 年前

A neat idea for sure, I just wanted to point out that this is why I prefer XPath over CSS selectors.We all know the display of the page and the structure of the page should be mutually exclusive, so why would you base your selectors on display? Particularly if you’re looking for something on a semantically designed page, why would I look for an .article, a class that may disappear with the next redesign, when they’re unlikely to stop using the article HTML tag?

评论 #31210324 未加载

mwcampbell大约 3 年前

> Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.In general, screen readers don't use class names or IDs. In principle they can, to enable site-specific workarounds for accessibility problems. But of course, that's as fragile as scraping. Perhaps you were thinking of semantic HTML tag names and ARIA roles.

评论 #31210336 未加载

invalidname大约 3 年前

Scraping is inherently fragile due to all the small changes that can happen to the data model as a website evolves. The important thing is to fix these things quickly. This article discusses a related approach of debugging such failures directly on the server: <a href="https://talktotheduck.dev/debugging-jsoup-java-code-in-production-using-lightrun" rel="nofollow">https://talktotheduck.dev/debugging-jsoup-java-code-in-produ...</a>It's in Java (using JSoup) but the approach will work for Node, Python, Kotlin etc. The core concept is to discover the cause of the regression instantly on the server and deploy a fix fast. There are also user specific regressions in scraping that are again very hard to debug.

leloctai大约 3 年前

This isn't future proof at all. Game dev had been using automatic memory obfuscation since forever. If this become popular, it will take no more than adding a webpack plugin to defeat, no data structure changes required.

kccqzy大约 3 年前

Very interesting! I have a feeling that this will break if people use the advanced mode of the Closure compiler. It's able to optimize away object attribute names. Is this not something commonly done anymore?

rvnx大约 3 年前

Nice this won't work anymore then

评论 #31211391 未加载

评论 #31212728 未加载

flockonus大约 3 年前

Awesome experimentation! I'd be curious to how you navigate the heap dump in some real website examples.

lemax大约 3 年前

I've used a similar technique on some web pages that get returned from the server with an in-tact redux state object just sitting in a <script> tag. Instead of parsing the HTML, I just pull out the state object. Super

marwis大约 3 年前

This sadly does not help if js code is minified/obfuscated and data is exchanged using some binary/binary-like protocol like grpc. Unfortunately this is increasingly common.The only long term way is to parse visible text.

评论 #31211109 未加载

BenGosub大约 3 年前

Is he scraping the heap because the data wasn't present in the HTML, or is he doing it because the API response, present in the heap changes less often than the HTML?

pabs3大约 3 年前

Seems easy to defeat by deleting objects after generating the HTML or DOM nodes? Although I suppose taking heap snapshots before the deletions would avoid that.

评论 #31212839 未加载

radicality大约 3 年前

Depending on how exactly the page is loading data, it might be easier to use something like mitmproxy and observe the data flow and intercept there.

dymk大约 3 年前

Would this method work if the website obfuscated its HTML as per the usual techniques, but also rendered everything server side?

评论 #31209448 未加载

EastSmith大约 3 年前

Someone knows if a Chrome browser extension has access to heap snapshots?

评论 #31211769 未加载

1vuio0pswjnm7大约 3 年前

Why doesn't the example chosen, YouTube, use something like Cloudflare "anti-bot" protection or Google reCAPTCHA.When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.

评论 #31211526 未加载

Jiger104大约 3 年前

Really cool approach, great work