Thanks for posting @mieubrisse! I haven't posted it on HN myself in a long time but I just released ArchiveBox v0.7.2 a couple days ago, so it's great timing.<p>I encourage people to also check out the list of ArchiveBox alternatives we maintain if ArchiveBox doesn't quite fit your needs.<p><a href="https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives">https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...</a>
Love this. That being said, I tried a bunch of these and landed on Shiori; I think my take was that ArchiveBox is great if you definitely want options and be comprehensive, but if you're mostly just going for the articles and text and want something simpler, this is it. (I teach at a college and don't want to lose good articles, and also gives me some nice uniform formatting)<p><a href="https://github.com/go-shiori/shiori">https://github.com/go-shiori/shiori</a>
I spun up my own Archivebox after archive.org wouldn't let me archive some news stories and I heard about them removing other content. Instead of calling the Internet Archive the wayback machine, I now call it the maybe back machine. IA is a centralized service and subject to the government and other powerful pressures any centralized popular service faces. If you want to archive something that might now or in future want to be erased by people in power, you should decentralize it to somewhere like an archivebox. This is especially useful if you are writing a book with many citations.
This is uncanny, I just discovered ArchiveBox earlier today and set up a self-hosted instance on some home hardware for a collection of bookmarks of useful guides, tutorials, and references I've collected over the years.<p>Setting it up on K8s with sonic [1] as the search backend and importing a few hundred URLs only took ~an hour or so, and the cached pages look great for the most part.<p>[1] <a href="https://github.com/valeriansaliou/sonic">https://github.com/valeriansaliou/sonic</a>
I looked at ArchiveBox and several similar projects a while ago, but realised I didn't want anything so complex. I just wanted bookmarks, with free-text content search so I could find something again based on more than just a title.<p>So I wrote my own: <a href="https://github.com/tardisx/linkwallet">https://github.com/tardisx/linkwallet</a><p>Emphasis on tiny system requirements and dependancies (single binary, no service dependencies). As a consequence the text indexing is very basic (basic HTML scrape). But it's working for me :-)
I researched various archiving alternatives for something I needed recently. I subscribe to a paid Substack for an educational course that will end mid-year, and I want to archive the course posts before it ends (the course provider has even recommended people end their Substack subscription after it ends).<p>For this purpose, I found the SingleFile browser extension to be the best fit. It's a browser extension, so paywall cookies are already present, and I just manually archive the previous week's content, <i>after</i> the discussion phase has concluded. It creates a single self-contained file with all images and comments, etc., but all non-page-local links still resolve externally (which is as-desired, for my use case). It can be configured to auto-generate a convenient filename, and to use self-extracting compression.<p>I preferred this to an automated process based on, e.g., RSS, because I can ensure the archive occurs <i>after</i> all the useful course comments back-and-forth has concluded, and it's trivial to set up and use.
I also came across ArchiveBox a few days ago to see if I should migrate off my home-grown solution with Puppeteer, SingleFile & readability.js.<p>I've been working on getting it deployed to fly.io with LSVD so it can scale to zero while storing everything on an S3-backed volume as described here[0].<p>My biggest disappointment so far is that it seems like a fairly large lift to make ublock origin work because extensions don't work in headless chrome (?). It seems like using pihole is current best method to block ads [1].<p>[0] <a href="https://community.fly.io/t/bottomless-s3-backed-volumes/15648">https://community.fly.io/t/bottomless-s3-backed-volumes/1564...</a>
[1] <a href="https://github.com/ArchiveBox/ArchiveBox/issues/211">https://github.com/ArchiveBox/ArchiveBox/issues/211</a>
For anyone who uses Chrome and wants to view their archived pages in the browser as if they were still online (URL and everything intact), and also full-text search through their browsing history that was archived (like AB plans to add in future, I think, right nikki?) you can check out DownloadNet: <a href="https://github.com/dosyago/DownloadNet">https://github.com/dosyago/DownloadNet</a><p>You can have multiple archives, and even use a mode where you only archive pages you bookmark rather than everything.
Last year I've been working in a Golang open source tool with a more modest approach by now (just command line) but with a similar goal (to keep personal info), in my tool formats are described using simple YAML templates and stored in a sqlite db file (<a href="https://github.com/khromalabs/keeper">https://github.com/khromalabs/keeper</a>), glad to know about more open source tools exploring similar ideas.
ArchiveBox is a great bit of kit and I've been using it for a while, I'm currently ingesting my browser bookmarks from Nextcloud bookmarks (using floccus sync from my browser) via RSS. That said, even though it's archiving features a poorer, I've been looking in using linkwarden for the partner approval factor and better integration with my SSO setup.
For those who want to test in unraid and run into root issue after initial setup:<p><a href="https://3xn.nl/projects/category/unraid/" rel="nofollow">https://3xn.nl/projects/category/unraid/</a><p>First time user, but its one of those things I did not know I wanted.
This is awesome, I couldn't identify from the readme how you tell it what to save and was wondering whether this could be driven by a Browser add-on/extension?