Well this is fun... from the README here I learned I can do this on macOS:<p><pre><code> /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless --incognito --dump-dom https://github.com > /tmp/github.html
</code></pre>
And get an HTML file for a page after the JavaScript has been executed.<p>Wrote up a TIL about this with more details: <a href="https://til.simonwillison.net/chrome/headless" rel="nofollow">https://til.simonwillison.net/chrome/headless</a><p>My own <a href="https://shot-scraper.datasette.io/" rel="nofollow">https://shot-scraper.datasette.io/</a> tool (which uses headless Playwright Chromium under the hood) has a command for this too:<p><pre><code> shot-scraper html https://github.com/ > /tmp/github.html
</code></pre>
But it's neat that you can do it with just Google Chrome installed and nothing else.
If anyone is interested, I wrote a long blog post where I analyzed all the various ways of saving HTML pages into a single file, starting back in the 90s. It'll answer a lot of questions asked in this thread (MHTML, SingleFile, web archive, etc.)<p><a href="https://www.russellbeattie.com/notes/posts/the-decades-long-html-bundle-quagmire.html" rel="nofollow">https://www.russellbeattie.com/notes/posts/the-decades-long-...</a>
I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)<p>An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.<p>I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)<p>On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)
How does this compare to SingleFile?<p><a href="https://www.npmjs.com/package/single-file-cli" rel="nofollow">https://www.npmjs.com/package/single-file-cli</a>
Hm, very interesting, especially for bookmarking/archiving.<p>I'm curious, why not use the MHTML standard for this?<p>- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.<p>- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.<p>- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).<p>- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the `binary` Content-Transfer-Encoding is widely supported.
I am really loving these 'new' pure rust tools that are super fast and efficient, with lovely API/doco. Ah, it feels like the 90s again... Minus 50% bugs probably.
I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.
Does anyone know how an entire website can be restored from Wayback Machine? A beloved website of mine had its database deleted. Everything's on Internet Archive, but I think I'd have to<p>(1) scrape it manually (they don't seem to let you download an entire site?),<p>(2) write some python magic to fix the css URLs etc so the site can be reuploaded (and maybe add .html to the URLs? Or just make everything a folder with index.html...)<p>It seems like a fairly common use case but I barely found functional scrapers, let alone anything designed to restore the original content in a useful form.
It would be awesome to see support for following links to a specified depth, similar to [Httrack](<a href="https://www.httrack.com/" rel="nofollow">https://www.httrack.com/</a>)
I wrote something very similar a few years ago – <a href="https://github.com/arp242/singlepage">https://github.com/arp242/singlepage</a><p>I mostly use it for a few Go programs where I generate HTML; I can "just" use links to external stylesheets and JavaScript because that's more convenient to work with, and then process it to produce a single HTML file.
Does anyone remember the Firefox extension Scrapbook, from "back in the day"? I used to use it a lot.<p>Look "back" 5 - 10 years, or more, and it's striking how many web resources are no longer available. A local copy is your only insurance. And even then, having it in an open, standards compliant format is important (e.g. a file you can load into a browser -- I guess either a current browser or a containerized/emulated one from the era of the archived resource).<p>Something that concerns me about JavaScript-ed resources and the like. Potentially unlimited complexity making local copies more challenging and perhaps untenable.
Related:<p><i>Show HN: CLI tool for saving web pages as a single file</i> - <a href="https://news.ycombinator.com/item?id=20774322">https://news.ycombinator.com/item?id=20774322</a> - August 2019 (209 comments)
How would I archive an on-prem hosted redmine solution (<a href="https://www.redmine.org/" rel="nofollow">https://www.redmine.org/</a>)? It is many, many years old and I want to abandon it for good but save everything and archive it. Is that possible with monolith?
Ironically, I decided to try with the repo's own Github page, and when I open the resulting HTML file in Chrome, it's all errors in the console, and I don't see the `README` or anything
A cool tool to be sure.<p>However I feel this tool is a crutch for the stupid way browsers handle web pages and shouldn't be necessary in a sane world.<p>Instead of the bullshit browsers do where they save a page as "blah.html" file + "blah_files" folder they should instead wrap both in folder that can then later be moved/copied as one unit and still benefit from it's subcomponents being easily accessed / picked apart as desired.
So what happens if the page is behind a paywall and the embedded Javascript stores some authentication or phone-home code? Does that end up getting invoked on the monolith copy HTML?<p>I'm wondering how this would work if I wanted to use it to, say, save a quiz from Udemy for offline review.
Or perhaps wget[0] as described here[1] and documented here[2] could do the trick.<p>0 - <a href="https://www.gnu.org/software/wget/" rel="nofollow">https://www.gnu.org/software/wget/</a><p>1 - <a href="https://tinkerlog.dev/journal/downloading-a-webpage-and-all-of-its-assets-with-wget" rel="nofollow">https://tinkerlog.dev/journal/downloading-a-webpage-and-all-...</a><p>2 - <a href="https://www.gnu.org/software/wget/manual/wget.html" rel="nofollow">https://www.gnu.org/software/wget/manual/wget.html</a>