ArchiveBox: Open-source self-hosted web archiving

250 pointsby mieubrisseover 1 year ago

16 comments

Thanks for posting @mieubrisse! I haven't posted it on HN myself in a long time but I just released ArchiveBox v0.7.2 a couple days ago, so it's great timing.I encourage people to also check out the list of ArchiveBox alternatives we maintain if ArchiveBox doesn't quite fit your needs.<a href="https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives">https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...</a>

评论 #38963430 未加载

评论 #38963304 未加载

评论 #38979469 未加载

jrm4over 1 year ago

Love this. That being said, I tried a bunch of these and landed on Shiori; I think my take was that ArchiveBox is great if you definitely want options and be comprehensive, but if you're mostly just going for the articles and text and want something simpler, this is it. (I teach at a college and don't want to lose good articles, and also gives me some nice uniform formatting)<a href="https://github.com/go-shiori/shiori">https://github.com/go-shiori/shiori</a>

kornholeover 1 year ago

I spun up my own Archivebox after archive.org wouldn't let me archive some news stories and I heard about them removing other content. Instead of calling the Internet Archive the wayback machine, I now call it the maybe back machine. IA is a centralized service and subject to the government and other powerful pressures any centralized popular service faces. If you want to archive something that might now or in future want to be erased by people in power, you should decentralize it to somewhere like an archivebox. This is especially useful if you are writing a book with many citations.

评论 #38961357 未加载

评论 #38961181 未加载

评论 #38960554 未加载

aboundover 1 year ago

This is uncanny, I just discovered ArchiveBox earlier today and set up a self-hosted instance on some home hardware for a collection of bookmarks of useful guides, tutorials, and references I've collected over the years.Setting it up on K8s with sonic [1] as the search backend and importing a few hundred URLs only took ~an hour or so, and the cached pages look great for the most part.[1] <a href="https://github.com/valeriansaliou/sonic">https://github.com/valeriansaliou/sonic</a>

tardisxover 1 year ago

I looked at ArchiveBox and several similar projects a while ago, but realised I didn't want anything so complex. I just wanted bookmarks, with free-text content search so I could find something again based on more than just a title.So I wrote my own: <a href="https://github.com/tardisx/linkwallet">https://github.com/tardisx/linkwallet</a>Emphasis on tiny system requirements and dependancies (single binary, no service dependencies). As a consequence the text indexing is very basic (basic HTML scrape). But it's working for me :-)

评论 #38987165 未加载

parastiover 1 year ago

The screenshot section single-handedly breaks mobile UX due to overflow.

评论 #39007181 未加载

dundariousover 1 year ago

I researched various archiving alternatives for something I needed recently. I subscribe to a paid Substack for an educational course that will end mid-year, and I want to archive the course posts before it ends (the course provider has even recommended people end their Substack subscription after it ends).For this purpose, I found the SingleFile browser extension to be the best fit. It's a browser extension, so paywall cookies are already present, and I just manually archive the previous week's content, after the discussion phase has concluded. It creates a single self-contained file with all images and comments, etc., but all non-page-local links still resolve externally (which is as-desired, for my use case). It can be configured to auto-generate a convenient filename, and to use self-extracting compression.I preferred this to an automated process based on, e.g., RSS, because I can ensure the archive occurs after all the useful course comments back-and-forth has concluded, and it's trivial to set up and use.

评论 #38961478 未加载

评论 #38961378 未加载

评论 #38964506 未加载

评论 #38960559 未加载

dtkavover 1 year ago

I also came across ArchiveBox a few days ago to see if I should migrate off my home-grown solution with Puppeteer, SingleFile & readability.js.I've been working on getting it deployed to fly.io with LSVD so it can scale to zero while storing everything on an S3-backed volume as described here[0].My biggest disappointment so far is that it seems like a fairly large lift to make ublock origin work because extensions don't work in headless chrome (?). It seems like using pihole is current best method to block ads [1].[0] <a href="https://community.fly.io/t/bottomless-s3-backed-volumes/15648">https://community.fly.io/t/bottomless-s3-backed-volumes/1564...</a> [1] <a href="https://github.com/ArchiveBox/ArchiveBox/issues/211">https://github.com/ArchiveBox/ArchiveBox/issues/211</a>

评论 #39007210 未加载

locengover 1 year ago

Are there any figures available anywhere as to how many people actively-passively maintain a personal-private archive?

评论 #38961316 未加载

keepamovinover 1 year ago

For anyone who uses Chrome and wants to view their archived pages in the browser as if they were still online (URL and everything intact), and also full-text search through their browsing history that was archived (like AB plans to add in future, I think, right nikki?) you can check out DownloadNet: <a href="https://github.com/dosyago/DownloadNet">https://github.com/dosyago/DownloadNet</a>You can have multiple archives, and even use a mode where you only archive pages you bookmark rather than everything.

rgomezover 1 year ago

Last year I've been working in a Golang open source tool with a more modest approach by now (just command line) but with a similar goal (to keep personal info), in my tool formats are described using simple YAML templates and stored in a sqlite db file (<a href="https://github.com/khromalabs/keeper">https://github.com/khromalabs/keeper</a>), glad to know about more open source tools exploring similar ideas.

dugite-codeover 1 year ago

ArchiveBox is a great bit of kit and I've been using it for a while, I'm currently ingesting my browser bookmarks from Nextcloud bookmarks (using floccus sync from my browser) via RSS. That said, even though it's archiving features a poorer, I've been looking in using linkwarden for the partner approval factor and better integration with my SSO setup.

A4ET8a8uTh0over 1 year ago

For those who want to test in unraid and run into root issue after initial setup:<a href="https://3xn.nl/projects/category/unraid/" rel="nofollow">https://3xn.nl/projects/category/unraid/</a>First time user, but its one of those things I did not know I wanted.

评论 #38964023 未加载

theKover 1 year ago

This is awesome, I couldn't identify from the readme how you tell it what to save and was wondering whether this could be driven by a Browser add-on/extension?

评论 #39007231 未加载

CrypticShiftover 1 year ago

This is one of those great projects that would benefit from local LLM integration.

评论 #38964007 未加载

valskover 1 year ago

This was created 5 years ago..

评论 #38961593 未加载

评论 #38960072 未加载