Show HN: Crawl a modern website to a zip, serve the website from the zip

223 pointsby unlog12 months ago

17 comments

kitd12 months ago

Nice work!Obligatory mention for RedBean, the server that you can package along with all assets (incl db, scripting and TLS support) into a single multi-platform binary.<a href="https://redbean.dev/" rel="nofollow">https://redbean.dev/</a>

评论 #40641129 未加载

jll2912 months ago

Microsoft Interne Explorer (no, I'm not using it personally) had a file format called *.mht that could save a HTML page together with all the files referenced from it like inline images. I believe you could not store more than one page in one *.mht file, though, so your work could be seen as an extension.Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).

评论 #40638809 未加载

评论 #40639679 未加载

评论 #40643350 未加载

评论 #40646654 未加载

ProtoAES25611 months ago

Wow! I never knew things like this existed! I always used wget (full below) but nowadays seemingly all sites are behind cloudflare so I need to pass a cookie too.Glad to see easier methods!<pre><code> wget \ --header "Cookie: <cf or other>" --user-agent="<UA>" --recursive \ --level 5 \ --no-clobber \ --page-requisites \ --adjust-extension \ --span-hosts \ --convert-links \ --domains <example.com> \ --no-parent \ <example.com\sub></code></pre>

unlog12 months ago

I'm a big fan of modern JavaScript frameworks, but I don't fancy SSR, so have been experimenting with crawling myself for uploading to hosts without having to do SSR. This is the result

评论 #40644326 未加载

renegat0x012 months ago

My 5 cents:- status codes 200-299 are all OK- status codes 300-399 are redirects, and also can be OK eventually- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project- It would be interesting to generate hash from app, and update only if hash is different?

评论 #40638964 未加载

评论 #40645068 未加载

tamimio12 months ago

How is it different from HTTrack? And what about the media extension, which one is supported and which one isn’t? Sometimes when I download some sites with HTTrack, some files just get ignored because by default it looks only for default types, and you have to manually add them there.

评论 #40639023 未加载

Per_Bothner11 months ago

The libwebsockets server (<a href="https://libwebsockets.org" rel="nofollow">https://libwebsockets.org</a>) supports serving directly from zip archives. Furthermore, if a URL is mapped to a compressed archive member, and assuming the browser can accept gzip-compressed files (as most can), then the compressed data is copied from archive over http to the browser, without de-compressing or conversion by the server. The server does a little bit of header fiddling but otherwise sends the raw bytes to the browser, which automatically decompresses it.

sedawk11 months ago

I used to use MAFF (Mozilla Archive Format)[1] a lot back in the day. I was very upset when they ended the support[2].I never dug deeper whether I can unzip and decode the packing, but saving as simple ZIP does somewhat guarantee future-proofing.[1] <a href="https://en.wikipedia.org/wiki/Mozilla_Archive_Format" rel="nofollow">https://en.wikipedia.org/wiki/Mozilla_Archive_Format</a>[2] <a href="https://support.mozilla.org/en-US/questions/1180271" rel="nofollow">https://support.mozilla.org/en-US/questions/1180271</a>

nox10112 months ago

I'm curious about this vs a .har fileIn Chrome Devtools, network tab, last icon that looks like an arrow pointing into a dish (Export har file)I guess a .har file as ton more data though I used it to extract data from sites that either intensionally or unintentionally make it hard to get data. For example, signing up for an apartment the apartment management site used pdf.js and provided no way to save the PDF. So saved the .har file and extracted the PDF.

评论 #40641863 未加载

earleybird12 months ago

Understood that this is early times, are you considering a licence to release it under?

评论 #40636343 未加载

评论 #40636271 未加载

CGamesPlay11 months ago

I like the approach here! Saving to a simple zip file is elegant. I worked on a similar idea years ago [0], but made the mistake of building it as a frontend. In retrospect, I would make this crawl using a headless browser and serve it via a web application, like you're doing.I would love to see better support for SPAs, where we can't just start from a sitemap. If you're interested in, you can check out some of the code from my old app for inspiration on how to crawl pages (it's Electron, so it will share a lot of interfaces with Puppeteer) [1].[0] <a href="https://github.com/CGamesPlay/chronicler/tree/master">https://github.com/CGamesPlay/chronicler/tree/master</a> [1] <a href="https://github.com/CGamesPlay/chronicler/blob/master/src/main/ScrapeRunner.js#L123">https://github.com/CGamesPlay/chronicler/blob/master/src/mai...</a>

评论 #40646346 未加载

jayemar12 months ago

Is the output similar to a web archive file (warc)?

评论 #40636333 未加载

billpg11 months ago

<a href="https://example.com/droste.zip/droste.zip/droste.zip/droste.zip/droste.zip/droste.zip/droste.zip/droste.zip/" rel="nofollow">https://example.com/droste.zip/droste.zip/droste.zip/droste....</a>...

ryanwaldorf12 months ago

What's the benefit of this approach?

评论 #40635655 未加载

评论 #40635210 未加载

szhabolcs11 months ago

How does it work with single page apps? If the data is loaded from the server, does it save the page contents as full, or just the source of the page?

评论 #40645163 未加载

ivolimmen12 months ago

So a modern chm (Microsoft Compiled HTML help file)

meiraleal12 months ago

Seems like a very useful tool to impersonate websites. Useful to scammers. Why would someone crawl their own website?

评论 #40638744 未加载

评论 #40639866 未加载

评论 #40640932 未加载