Whee, a topic close to my heart!<p><a href="https://github.com/fake-name/ReadableWebProxy" rel="nofollow">https://github.com/fake-name/ReadableWebProxy</a> is a project of mine that started out as a simple rewriting proxy, but at this point is basically a self-contained archival system for entire websites, complete with preserved historical versions of scraped content. It has a distributed fetching frontend[1], uses chromium[2] to optionally deal with internet breaking bullshit (Helloooo cloudflare! Fuuuuucccckkkkk yyyyooouuuuuuu), supports multiple archival modes (raw, e.g. not rewritten and destyled, and a rewritten format which makes reading internet text content actually nice), and a bunch of other stuff. The links in fetched content are rewritten to point within the archiver, and if content is not already retreived, it's fetched on-the-fly as you browse.<p>It also has plugin-based content rewriting features, allowing the complete reformatting of content on-the-fly, and functions as a backend to a bunch of other projects (I run a translated light-novel/web-novel tracker site, and it also does the RSS parsing for that).<p>I've been occatonally meaning to add WARC forwarding to the frontend, and feed that into the internet archive, but the fetching frontend is old, creaky and brittle (it's some old code), and does a lot of esoteric stuff that would be hard to replicate.<p>[1]: <a href="https://github.com/fake-name/AutoTriever" rel="nofollow">https://github.com/fake-name/AutoTriever</a>
[2]: <a href="https://github.com/fake-name/ChromeController" rel="nofollow">https://github.com/fake-name/ChromeController</a>
Things that attempt to rewrite links and inline css and javascript are doomed to fail. Many sites do wierd javascript shenanigans, and without a million special cases, you'll never make it work reliably. Just try archiving your facebook news feed and let me know how it goes.<p>Instead, archivists should try to record the exact data sent between the server and a real browser, and then save that in a cache. Then, when viewing the archive, use the same browser and replay the same data, and you should see the exact same thing! With small tweaks to make everything deterministic (disallow true randomness in javascript, set the date and time back to the archiving date so SSL certs are still valid), this method can never 'bit rot'.<p>When technology moves on, and you can no longer run the browser and proxy, you wrap it all up in a virtual machine, and run it like that. Virtual machines have successfully preserved games consoles data nearly perfectly for ~40 years now, which is far better than pretty much any other approach.
In the realm of scraping and page archiving, I'd like to note a library I found useful recently, called `freeze-dry` [0][1]. It packages a page into a SINGLE HTML file, inlining relevant styles. The objective is to try and replicate the exact look and structure of the page instead of all the interactive elements. Very useful for building a training dataset for any algorithms that read web pages.<p>[0]: <a href="https://www.npmjs.com/package/freeze-dry" rel="nofollow">https://www.npmjs.com/package/freeze-dry</a><p>[1]: <a href="https://github.com/WebMemex/freeze-dry" rel="nofollow">https://github.com/WebMemex/freeze-dry</a>
I’ve been using<p><pre><code> wget -E -H -k -K -nd -N -p -P pageslug URL
</code></pre>
for some time now and never had any issues with it.
I created a .bash_aliases entry so that now I only have to type<p><pre><code> war pageslug URL
</code></pre>
to archive some website.<p>I haven’t archived to many websites (I focus more on media files like videos, ebooks and such) that’s probably why I haven’t run into any issues yet but I’d be interested if somebody has a link that doesn’t work with this method just so that I can see what the result would be like.<p>Here’s an explanation of the method I use for anyone interested:
<a href="https://gist.github.com/dannguyen/03a10e850656577cfb57" rel="nofollow">https://gist.github.com/dannguyen/03a10e850656577cfb57</a>
Archiving one's web browsing trail seems to be a common use case. Here are some promising related projects that have been on HN:<p>* <a href="https://github.com/pirate/bookmark-archiver" rel="nofollow">https://github.com/pirate/bookmark-archiver</a><p>* <a href="https://getpolarized.io/" rel="nofollow">https://getpolarized.io/</a>
I've got a set of small repos of small government sites I've used a combination of `wget` and `curl` and other shell commands to snapshot, mostly so I can have a reliable mirror when teaching web scraping: <a href="https://github.com/wgetsnaps" rel="nofollow">https://github.com/wgetsnaps</a><p>But as the submitted article points out, archiving the Web is much trickier these days, and wget is no longer sufficient for anything relatively modern. I've been impressed with what Internet Archive has seemingly been able to do, and I've been interested whether it's the result of improved techniques on their side, or of certain sites following a standard that happens to make them more archivable.<p>For example, 538's 2018 election trackers are very JS dependent, yet IA has managed to capture them in a way that not only preserves the look and content, but keeps their widgets and visualizations almost entirely fully functional:<p><a href="https://web.archive.org/web/20181102125134/https://projects.fivethirtyeight.com/2018-midterm-election-forecast/house/" rel="nofollow">https://web.archive.org/web/20181102125134/https://projects....</a><p>However, even the excellent archive of 538's site shows a huge weakness in IA's efforts: IA (quite understandably) aggressively caches a site's dependencies, such as external JS and JSON data files. If you scroll down the 538 example posted above, you'll see that despite being a snapshot on Nov. 2, 2018, many of its widgets only contain data from the last time IA fetched its external dependencies, which appears to be August 16, 2018.
This post doesn't mention the Webrecorder Player[1], which is a GUI app that displays WARC files. It's probably the easiest way to view Web ARChives.<p>For those willing to set up a docker container, check out Warcworker[2].<p>[1] <a href="https://github.com/webrecorder/webrecorder-player" rel="nofollow">https://github.com/webrecorder/webrecorder-player</a><p>[2] <a href="https://github.com/peterk/warcworker" rel="nofollow">https://github.com/peterk/warcworker</a>
Disclaimer: My company works with Teyit and I've built the archiving product. Also: This is a shameless plug.<p>Teyit.org[0], the biggest fact-checking organization in Turkey, has their own archiving site called teyit.link[1].<p>It's a non-profit organization and they automatically archive any link that's sent to them via their site, Twitter, Facebook etc.. It's also usable by the public.<p>It's open source on GitHub[2] and we've actually been developing a new version[3] and have a plan to add `youtube-dl` along with WARC.<p>[0] <a href="https://teyit.org" rel="nofollow">https://teyit.org</a><p>[1] <a href="https://www.teyit.link" rel="nofollow">https://www.teyit.link</a><p>[2] <a href="https://github.com/teyit/teyitlink-web" rel="nofollow">https://github.com/teyit/teyitlink-web</a><p>[3] <a href="https://github.com/noddigital/teyit.link" rel="nofollow">https://github.com/noddigital/teyit.link</a>
gwern has a very involved post on archiving as well, <a href="https://www.gwern.net/Archiving-URLs" rel="nofollow">https://www.gwern.net/Archiving-URLs</a><p>Somewhere on my to-do list is archiving everything I visit on the internet. It's frustrating to know that I've seen something, but be unable to find it again.
I think you guys might like this personal web archival tool I launched about a month ago:<p><a href="https://getpolarized.io/" rel="nofollow">https://getpolarized.io/</a><p>It's basically an offline browser where you can capture full HTML pages locally including the iframes, and tag and annotate the content.<p>I should have cloud sync support in the next release (1-2 weeks) which will allow you to keep your data in the cloud and sync it between machines. Initially it will just support Firebase but I have plans to support other cloud providers via plugins.<p>I'd also like to support end to end encryption so that you don't have to worry about people reading your data.<p>There's a huge Hacker News about Polar here:<p><a href="https://news.ycombinator.com/item?id=18219960" rel="nofollow">https://news.ycombinator.com/item?id=18219960</a><p>A semi-requested feature is full recursive archival of content but I don't think we're going to go in that direction. Instead I think we're going to support pasting or importing a list of URLs.<p>Many documentation sites have an 'index' of Table of Contents and this way I can just fetch and store all those URLs without over-fetching.<p>My background is search and I built a petabyte scale search service named Datastreamer (<a href="http://www.datastreamer.io/" rel="nofollow">http://www.datastreamer.io/</a>). I'm also one of the inventors of RSS - so I have a lot of ideas on the roadmap here.<p>It also supports PDFs, text and area highlights, comments, flashcards and sync with Anki.<p>The initial response after our release has been amazing. The user base is really engaged with thousands of monthly active users and contributors.<p>Anyway. Take it for a spin. It's free and Open Source.
I recently had to do this and after a lot of frustration with wget, httrack and some other commercial ones too, I ended up settling on the results of this free product, WebCopy.<p><a href="https://www.cyotek.com/cyotek-webcopy" rel="nofollow">https://www.cyotek.com/cyotek-webcopy</a><p>Background: We couldn't keep the existing platform running, so had to transition to static html files.<p>I used the WebCopy scan log to create the apache rewrite rules to preserve the existing link structure.<p>Where I say WebCopy was better, it was this simple log, but also the file structure it was producing was much cleaner with less junk pages and duplicates. (the site was an absolute inconsistent mess to begin with)
I am of the opinion that anything you might need to read two-three times over a longer period of time should be copied locally, or to some service, for later retrieval.<p>I use Pocket a lot, and "share" into it from different devices.<p>But sometimes just do "wget --mirror -np -k --limit-rate=10k <a href="https://interesting.stuff.here.org/"" rel="nofollow">https://interesting.stuff.here.org/"</a> on a PC.<p>I'd actually like a self-hosted variant of Pocket, but have not really researched if those even exist. Anyone with a suggestion?
Thanks for the mention of bookmark-archiver! WARC support has been high on my list for a long time, but unfortunately, I have a day job that keeps me super busy.<p>Also, the author, Antoine Beaupré is also an engineer living in Montreal who works on mesh networking stuff, are we the same person?! I just sent him an email to make sure it doesn't land in my own inbox...
author here. AMA.<p>since I wrote this, i started experimenting with grab-site:<p><a href="https://github.com/ludios/grab-site" rel="nofollow">https://github.com/ludios/grab-site</a><p>it's a wrapper around (and a fork of) wpull but the main advantage over it is it can do on-the-fly reconfiguration of delay, concurrency, ignore patterns and so on. it also provides a nice web interface. if you're only crawling one site every once in a while, wpull and crawl are fine, but for larger projects, grab-site is a must.
I was working on an archiving tool a little while back, though I haven't touched it recently.<p>It would recursively convert a page into a single URI. Chrome seems to have a limit for URLs, but Firefox doesn't, so far as I can tell.<p>Copy the contents of [0] into your URL bar, and you'll see not just the page, but the Python script is embedded in it too. (It's a bit long to dump onto a forum page).<p>[0] <a href="https://shakna.keybase.pub/offlineweb" rel="nofollow">https://shakna.keybase.pub/offlineweb</a>
Those desiring something simple can check out <a href="https://www.pagedash.com/" rel="nofollow">https://www.pagedash.com/</a>. Note: requires login and saves pages to the cloud.<p>(Disclaimer: I am the maker of PageDash)
I use a wget invocation similar to the one listed for ps-2.kev009.com. I've recently used HTTRack in a few places where that had issues and was impressed with it as well.
Love wget! The waybackmachine is a great tool but I wish there was a more robust/complete service out there. Maybe the government is/will archive the top million sites or something like that.