Monolith – CLI tool for saving complete web pages as a single HTML file

772 pointsby iscream26about 1 year ago

28 comments

simonwabout 1 year ago

Well this is fun... from the README here I learned I can do this on macOS:<pre><code> /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \ --headless --incognito --dump-dom https://github.com > /tmp/github.html </code></pre> And get an HTML file for a page after the JavaScript has been executed.Wrote up a TIL about this with more details: <a href="https://til.simonwillison.net/chrome/headless" rel="nofollow">https://til.simonwillison.net/chrome/headless</a>My own <a href="https://shot-scraper.datasette.io/" rel="nofollow">https://shot-scraper.datasette.io/</a> tool (which uses headless Playwright Chromium under the hood) has a command for this too:<pre><code> shot-scraper html https://github.com/ > /tmp/github.html </code></pre> But it's neat that you can do it with just Google Chrome installed and nothing else.

评论 #39814032 未加载

评论 #39812396 未加载

评论 #39818229 未加载

评论 #39812923 未加载

评论 #39812566 未加载

评论 #39823531 未加载

评论 #39818478 未加载

评论 #39813889 未加载

评论 #39812699 未加载

评论 #39811700 未加载

评论 #39821741 未加载

russellbeattieabout 1 year ago

If anyone is interested, I wrote a long blog post where I analyzed all the various ways of saving HTML pages into a single file, starting back in the 90s. It'll answer a lot of questions asked in this thread (MHTML, SingleFile, web archive, etc.)<a href="https://www.russellbeattie.com/notes/posts/the-decades-long-html-bundle-quagmire.html" rel="nofollow">https://www.russellbeattie.com/notes/posts/the-decades-long-...</a>

评论 #39812553 未加载

andaiabout 1 year ago

I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)

评论 #39812170 未加载

评论 #39814098 未加载

评论 #39811987 未加载

评论 #39812186 未加载

评论 #39848318 未加载

评论 #39811944 未加载

评论 #39812367 未加载

lopkeny12koabout 1 year ago

How does this compare to SingleFile?<a href="https://www.npmjs.com/package/single-file-cli" rel="nofollow">https://www.npmjs.com/package/single-file-cli</a>

评论 #39811468 未加载

jchookabout 1 year ago

Hm, very interesting, especially for bookmarking/archiving.I'm curious, why not use the MHTML standard for this?- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the `binary` Content-Transfer-Encoding is widely supported.

评论 #39814571 未加载

评论 #39813969 未加载

keyleabout 1 year ago

I am really loving these 'new' pure rust tools that are super fast and efficient, with lovely API/doco. Ah, it feels like the 90s again... Minus 50% bugs probably.

评论 #39816171 未加载

al_borlandabout 1 year ago

I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.

评论 #39811661 未加载

评论 #39811789 未加载

评论 #39811755 未加载

评论 #39812380 未加载

评论 #39811553 未加载

andaiabout 1 year ago

Does anyone know how an entire website can be restored from Wayback Machine? A beloved website of mine had its database deleted. Everything's on Internet Archive, but I think I'd have to(1) scrape it manually (they don't seem to let you download an entire site?),(2) write some python magic to fix the css URLs etc so the site can be reuploaded (and maybe add .html to the URLs? Or just make everything a folder with index.html...)It seems like a fairly common use case but I barely found functional scrapers, let alone anything designed to restore the original content in a useful form.

评论 #39811917 未加载

评论 #39811912 未加载

joeyhageabout 1 year ago

It would be awesome to see support for following links to a specified depth, similar to [Httrack](<a href="https://www.httrack.com/" rel="nofollow">https://www.httrack.com/</a>)

评论 #39811263 未加载

评论 #39811826 未加载

arp242about 1 year ago

I wrote something very similar a few years ago – <a href="https://github.com/arp242/singlepage">https://github.com/arp242/singlepage</a>I mostly use it for a few Go programs where I generate HTML; I can "just" use links to external stylesheets and JavaScript because that's more convenient to work with, and then process it to produce a single HTML file.

fagrobotabout 1 year ago

<a href="https://github.com/gildas-lormeau/SingleFile">https://github.com/gildas-lormeau/SingleFile</a>

pbnjehabout 1 year ago

Does anyone remember the Firefox extension Scrapbook, from "back in the day"? I used to use it a lot.Look "back" 5 - 10 years, or more, and it's striking how many web resources are no longer available. A local copy is your only insurance. And even then, having it in an open, standards compliant format is important (e.g. a file you can load into a browser -- I guess either a current browser or a containerized/emulated one from the era of the archived resource).Something that concerns me about JavaScript-ed resources and the like. Potentially unlimited complexity making local copies more challenging and perhaps untenable.

toomuchtodoabout 1 year ago

Related:Show HN: CLI tool for saving web pages as a single file - <a href="https://news.ycombinator.com/item?id=20774322">https://news.ycombinator.com/item?id=20774322</a> - August 2019 (209 comments)

lagt_tabout 1 year ago

I remember IE5 was able to do this lol. It fell out of vogue for some reason, glad to see the concept is still alive.

评论 #39811232 未加载

评论 #39811018 未加载

max_about 1 year ago

It still blows my mind that browsers don't provide features this out of the box.

评论 #39815090 未加载

评论 #39815514 未加载

评论 #39814077 未加载

评论 #39814609 未加载

publius_0xf3about 1 year ago

Awesome tool. A note to the devs: the latest version on winget is v2.7.0, which is several months behind the latest version.

k1ck4ssabout 1 year ago

How would I archive an on-prem hosted redmine solution (<a href="https://www.redmine.org/" rel="nofollow">https://www.redmine.org/</a>)? It is many, many years old and I want to abandon it for good but save everything and archive it. Is that possible with monolith?

评论 #39814422 未加载

farzadmfabout 1 year ago

Ironically, I decided to try with the repo's own Github page, and when I open the resulting HTML file in Chrome, it's all errors in the console, and I don't see the `README` or anything

评论 #39816300 未加载

stringtointabout 1 year ago

Nice! Reminds me of the time I was working on a browser extension to do this.

causality0about 1 year ago

How's this better than the MHTML functionality built into my browser?

评论 #39811476 未加载

dosourcenotcodeabout 1 year ago

A cool tool to be sure.However I feel this tool is a crutch for the stupid way browsers handle web pages and shouldn't be necessary in a sane world.Instead of the bullshit browsers do where they save a page as "blah.html" file + "blah_files" folder they should instead wrap both in folder that can then later be moved/copied as one unit and still benefit from it's subcomponents being easily accessed / picked apart as desired.

评论 #39812551 未加载

fs111about 1 year ago

<a href="https://en.wikipedia.org/wiki/WARC_(file_format)" rel="nofollow">https://en.wikipedia.org/wiki/WARC_(file_format)</a>

victorbjorklundabout 1 year ago

This is great. I have wished for something like this.

AdmiralAsshatabout 1 year ago

So what happens if the page is behind a paywall and the embedded Javascript stores some authentication or phone-home code? Does that end up getting invoked on the monolith copy HTML?I'm wondering how this would work if I wanted to use it to, say, save a quiz from Udemy for offline review.

评论 #39821054 未加载

dohello1about 1 year ago

and I thought my code pages were long haha

sunshine202022about 1 year ago

fun

ethanpilabout 1 year ago

Nice. My next step: Figure out how to make a web extension 1 click button. Tab to Monolith to Joplin with a tag.

评论 #39811369 未加载

AdieuToLogicabout 1 year ago

Or perhaps wget[0] as described here[1] and documented here[2] could do the trick.0 - <a href="https://www.gnu.org/software/wget/" rel="nofollow">https://www.gnu.org/software/wget/</a>1 - <a href="https://tinkerlog.dev/journal/downloading-a-webpage-and-all-of-its-assets-with-wget" rel="nofollow">https://tinkerlog.dev/journal/downloading-a-webpage-and-all-...</a>2 - <a href="https://www.gnu.org/software/wget/manual/wget.html" rel="nofollow">https://www.gnu.org/software/wget/manual/wget.html</a>

评论 #39812227 未加载

28 comments

simonwabout 1 year ago

评论 #39814032 未加载

评论 #39812396 未加载

评论 #39818229 未加载

评论 #39812923 未加载

评论 #39812566 未加载

评论 #39823531 未加载

评论 #39818478 未加载

评论 #39813889 未加载

评论 #39812699 未加载

评论 #39811700 未加载

评论 #39821741 未加载

russellbeattieabout 1 year ago

评论 #39812553 未加载

andaiabout 1 year ago

评论 #39812170 未加载

评论 #39814098 未加载

评论 #39811987 未加载

评论 #39812186 未加载

评论 #39848318 未加载

评论 #39811944 未加载

评论 #39812367 未加载

lopkeny12koabout 1 year ago

How does this compare to SingleFile?<a href="https://www.npmjs.com/package/single-file-cli" rel="nofollow">https://www.npmjs.com/package/single-file-cli</a>

评论 #39811468 未加载

jchookabout 1 year ago

评论 #39814571 未加载

评论 #39813969 未加载

keyleabout 1 year ago

I am really loving these 'new' pure rust tools that are super fast and efficient, with lovely API/doco. Ah, it feels like the 90s again... Minus 50% bugs probably.

评论 #39816171 未加载

al_borlandabout 1 year ago

评论 #39811661 未加载

评论 #39811789 未加载

评论 #39811755 未加载

评论 #39812380 未加载

评论 #39811553 未加载

andaiabout 1 year ago

评论 #39811917 未加载

评论 #39811912 未加载

joeyhageabout 1 year ago

It would be awesome to see support for following links to a specified depth, similar to [Httrack](<a href="https://www.httrack.com/" rel="nofollow">https://www.httrack.com/</a>)

评论 #39811263 未加载

评论 #39811826 未加载

arp242about 1 year ago

fagrobotabout 1 year ago

<a href="https://github.com/gildas-lormeau/SingleFile">https://github.com/gildas-lormeau/SingleFile</a>

pbnjehabout 1 year ago

toomuchtodoabout 1 year ago

lagt_tabout 1 year ago

I remember IE5 was able to do this lol. It fell out of vogue for some reason, glad to see the concept is still alive.

评论 #39811232 未加载

评论 #39811018 未加载

max_about 1 year ago

It still blows my mind that browsers don't provide features this out of the box.

评论 #39815090 未加载

评论 #39815514 未加载

评论 #39814077 未加载

评论 #39814609 未加载

publius_0xf3about 1 year ago

Awesome tool. A note to the devs: the latest version on winget is v2.7.0, which is several months behind the latest version.

k1ck4ssabout 1 year ago

评论 #39814422 未加载

farzadmfabout 1 year ago

Ironically, I decided to try with the repo's own Github page, and when I open the resulting HTML file in Chrome, it's all errors in the console, and I don't see the `README` or anything

评论 #39816300 未加载

stringtointabout 1 year ago

Nice! Reminds me of the time I was working on a browser extension to do this.

causality0about 1 year ago

How's this better than the MHTML functionality built into my browser?

评论 #39811476 未加载

dosourcenotcodeabout 1 year ago

评论 #39812551 未加载

fs111about 1 year ago

<a href="https://en.wikipedia.org/wiki/WARC_(file_format)" rel="nofollow">https://en.wikipedia.org/wiki/WARC_(file_format)</a>

victorbjorklundabout 1 year ago

This is great. I have wished for something like this.

AdmiralAsshatabout 1 year ago

评论 #39821054 未加载

dohello1about 1 year ago

and I thought my code pages were long haha

sunshine202022about 1 year ago

fun

ethanpilabout 1 year ago

Nice. My next step: Figure out how to make a web extension 1 click button. Tab to Monolith to Joplin with a tag.

评论 #39811369 未加载

AdieuToLogicabout 1 year ago

评论 #39812227 未加载