Nice work!<p>Obligatory mention for RedBean, the server that you can package along with all assets (incl db, scripting and TLS support) into a single multi-platform binary.<p><a href="https://redbean.dev/" rel="nofollow">https://redbean.dev/</a>
Microsoft Interne Explorer (no, I'm not using it personally) had a file format called *.mht that could save a HTML page together with all the files referenced from it like inline images. I believe you could not store more than one page in one *.mht file, though, so your work could be seen as an extension.<p>Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).
Wow! I never knew things like this existed! I always used wget (full below) but nowadays seemingly all sites are behind cloudflare so I need to pass a cookie too.<p>Glad to see easier methods!<p><pre><code> wget \
--header "Cookie: <cf or other>"
--user-agent="<UA>"
--recursive \
--level 5 \
--no-clobber \
--page-requisites \
--adjust-extension \
--span-hosts \
--convert-links \
--domains <example.com> \
--no-parent \
<example.com\sub></code></pre>
I'm a big fan of modern JavaScript frameworks, but I don't fancy SSR, so have been experimenting with crawling myself for uploading to hosts without having to do SSR. This is the result
My 5 cents:<p>- status codes 200-299 are all OK<p>- status codes 300-399 are redirects, and also can be OK eventually<p>- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK<p>- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project<p>- It would be interesting to generate hash from app, and update only if hash is different?
How is it different from HTTrack? And what about the media extension, which one is supported and which one isn’t? Sometimes when I download some sites with HTTrack, some files just get ignored because by default it looks only for default types, and you have to manually add them there.
The libwebsockets server (<a href="https://libwebsockets.org" rel="nofollow">https://libwebsockets.org</a>) supports serving directly from zip archives. Furthermore, if a URL is mapped to a compressed archive member, and assuming the browser can accept gzip-compressed files (as most can), then the compressed data is copied from archive over http to the browser, without de-compressing or conversion by the server. The server does a little bit of header fiddling but otherwise sends the raw bytes to the browser, which automatically decompresses it.
I used to use MAFF (Mozilla Archive Format)[1] a lot back in the day. I was very upset when they ended the support[2].<p>I never dug deeper whether I can unzip and decode the packing, but saving as simple ZIP does somewhat guarantee future-proofing.<p><i>[1] <a href="https://en.wikipedia.org/wiki/Mozilla_Archive_Format" rel="nofollow">https://en.wikipedia.org/wiki/Mozilla_Archive_Format</a></i><p><i>[2] <a href="https://support.mozilla.org/en-US/questions/1180271" rel="nofollow">https://support.mozilla.org/en-US/questions/1180271</a></i>
I'm curious about this vs a .har file<p>In Chrome Devtools, network tab, last icon that looks like an arrow pointing into a dish (Export har file)<p>I guess a .har file as ton more data though I used it to extract data from sites that either intensionally or unintentionally make it hard to get data. For example, signing up for an apartment the apartment management site used pdf.js and provided no way to save the PDF. So saved the .har file and extracted the PDF.
I like the approach here! Saving to a simple zip file is elegant. I worked on a similar idea years ago [0], but made the mistake of building it as a frontend. In retrospect, I would make this crawl using a headless browser and serve it via a web application, like you're doing.<p>I would love to see better support for SPAs, where we can't just start from a sitemap. If you're interested in, you can check out some of the code from my old app for inspiration on how to crawl pages (it's Electron, so it will share a lot of interfaces with Puppeteer) [1].<p>[0] <a href="https://github.com/CGamesPlay/chronicler/tree/master">https://github.com/CGamesPlay/chronicler/tree/master</a>
[1] <a href="https://github.com/CGamesPlay/chronicler/blob/master/src/main/ScrapeRunner.js#L123">https://github.com/CGamesPlay/chronicler/blob/master/src/mai...</a>