Show HN: Local Node.js app to save everything you browse and serve it offline

406 点作者 archivist1超过 5 年前

23 条评论

It's a bit disappointing to see other people showcasing their own work without a single mention of the link above. Perhaps make your own submission instead if that's the intention?As a casual reader and others obvious interest in this area I'd very much prefer a sentence or two about the quality of the work presented, feel free to link your own stuff afterwards, it's a bit offputting to see such blatant self-promotion.

评论 #21857086 未加载

评论 #21856929 未加载

评论 #21858043 未加载

评论 #21858130 未加载

评论 #21857381 未加载

评论 #21860210 未加载

thefreeman超过 5 年前

Congrats on getting your project to the front page of HN. With that said I think you are going to need to change your approach if you want this project to be usable as more than a toy project in the long run.From what I can tell it essentially saves a map of url -> response in memory as you browse. Every 10 seconds this file is serialized to json and dumped to a cache.json file. This is going to be very inefficient as the number of web pages indexed grows since you are rewriting the entire cache every 10 seconds even if only a few pages have been added to it. It also will eventually exceed the memory of the computer running the app if the content of every page ever visited needs to be loaded into memory. I highly recommend looking into some of the other suggestions mentioned here, either sqlite or mapping a local directory structure to your caching strategy so that you can easily query a given url without keeping the entire cache in memory, and also add / update urls without rewriting the entire cache.

评论 #21857928 未加载

grizzles超过 5 年前

You could store the data in a git repo per domain, so that implicit de-duplication happens on re-visits & for shared resources.You could have a raw dir (the files you receive from the server) and a render dir that consists of snapshots of the DOM + CSS with no JS & external resource complexity.When the global archive becomes too big, history could be discarded from all the git repos by discarding the oldest commit in each repo, and so on.SOLR is probably the right tool for the index but there is something undeniably appealing about staying in the pure file paradigm - you could use sqlite's FTS5 module to do that too.

评论 #21856695 未加载

评论 #21856429 未加载

评论 #21858331 未加载

aloer超过 5 年前

what are the security implications of permanently running chrome in remote debug mode?a bit more than half a year ago I started playing around with this and was surprised how on the one hand there are really really good tools nowadays for self archiving but how on the other hand there hasn't been any progress in implementing these in a, for end users, comfortable wayMy working theory right now is that saving every request/response as well as every interaction on a page should allow us to completely restore web site state at any point in time and will open up some super interesting use cases around our interaction with information found onlineBut in order to do this it seems necessary to go through the remote debug protocol like this project here is doing. And since this is somewhat of an unusual approach I could not find much information about the security aspect of running every site at any time with remote debugging activated. Common web scrapers/archiving tools will instead only use remote chrome debug to open and capture specific urlsStorage is so dirt cheap today that there is zero reason why we shouldn't have reliable historic website state for everything we have ever looked atAnd judging by the HN front pages of the last months, many here are interested in this and related use cases (search/index/annotations/collaborative browsing)

评论 #21856893 未加载

评论 #21857346 未加载

评论 #21857289 未加载

评论 #21858097 未加载

EastSmith超过 5 年前

Like 20 years ago I've used a program called Teleport Pro, to do something similar.I would dial up with my phone modem when the internet access was cheap (during the night), it would automatically browse a page I provided, and in the morning I would have the page ready to read.Fun times with 10 to 20 kb/s speeds.

评论 #21857326 未加载

Eikon超过 5 年前

I’m curious of why did you go on the path of using chrome debugging functionality instead of implementing an HTTP proxy which would provides the benefit of behind browser-agnostic too.May you expand on that please?

评论 #21856656 未加载

评论 #21856345 未加载

评论 #21856160 未加载

wanderingstan超过 5 年前

Great feature. Though it feels like a UI misstep that the user had to use npm to switch between recording and browsing. A nicer solution could be a chrome extension button, or access the archived version via a synthetic domain. E.g. example.com.archived

评论 #21856598 未加载

评论 #21856218 未加载

jdmoreira超过 5 年前

Great idea! I like the concept.One of the things I miss most about the old web was how trivial it was to local mirror any website. It was great!

ksec超过 5 年前

I remember I tried something similar a long time ago but decided it wasn't worth it.2MB per page at 100 page a day, 200MB / day. That is 73GB per year.May be once a year I get the problem where I remember reading something about it but could not google back the exact page in my memory. So I had a proxy solution set up, but the maths ends up it wasn't worth paying the Storage cost just for this one time convenient.

评论 #21860350 未加载

mikece超过 5 年前

Between this project and the others mentioned in the discussion these are excellent resources for anyone needing to have a forensic record of their how they assembled evidence from browsing open sources on the internet. Package this as a VM that can be quickly spun up new and fresh per case and sell support for LE types and you’ve got a business.

评论 #21856836 未加载

jimbob45超过 5 年前

Nitpicking but am I the only one who hates "serve" being used in strange contexts? IMHO to serve is to send something over a network. If it's all happening locally, the verb should be "load" because it's just taking a file and loading it into a browser at that point.

评论 #21859485 未加载

jchook超过 5 年前

Really brilliant implementation concept.I love how it uses the browsers debug port to save literally everything. I have often dreamed of “a Google for everything I’ve seen before”.I recently spent some time making something like this and hope to release it soon as FOSS. However, it differs in some critical ways.I desire to:- save pages of interest, but not a firehose of everything I ever see- save from anywhere on any internet device (eg mobile phone)- Archive rich content like YouTube videos or songs even if I do not watch the entire (or any of) the video, and supporting credentials (eg .netrc)Looking forward to digging deeper into this thread and your project for more ideas!

评论 #21861572 未加载

olah_1超过 5 年前

You should add "upload / sync with decentralized storage" to the future goals.Seems like a logical next step to have it sync to an IPFS or Dat drive. Not sure how it would be implemented though.

jan6超过 5 年前

I love how there's only a single browser or two in the entire world, lol (safari I've got bo clue about) that's while assuming chrome and firefox's debugging streams would be compatible....you assume I don't use any forks, or custom versions, what if I use an Electron based browser? what about Pale Moon or other forks which have older, if any, such interfaces? what about Opera? etc etc, you get the point...I hope...

CGamesPlay超过 5 年前

Bump for my related project: <a href="https://github.com/CGamesPlay/chronicler" rel="nofollow">https://github.com/CGamesPlay/chronicler</a>I'm actually in the process of rewriting this. I like your approach of using DevTools to manage the requests, the approach taken in Chronicler is to hook into Chrome's actual request engine.You might like to look at Chronicler to see some attempts at UI for a project like this, particular decisions around what to download and how to retrieve it.

it超过 5 年前

Related: I'm making a program in Go to inline all the resources for a web page so it ends up being a single file that you can work with offline more easily: <a href="https://github.com/ijt/inline" rel="nofollow">https://github.com/ijt/inline</a>.

jimktrains2超过 5 年前

I've been building something similar, but that uses Firefox sync to grab history and bookmarks. <a href="https://github.com/jimktrains/ffsyncsearch" rel="nofollow">https://github.com/jimktrains/ffsyncsearch</a>

评论 #21857428 未加载

archivist1超过 5 年前

If anyone would be interested in the next major version, please add your email to this list to be notified: <a href="https://forms.gle/FJmsXCDy18RrbFtt9" rel="nofollow">https://forms.gle/FJmsXCDy18RrbFtt9</a>

calpaterson超过 5 年前

Nice job, I think this is promising but there has got to be a better way than having people enable their debugger. Is there any reason you can't just copy the contents of each page and then post it somewhere?

评论 #21858196 未加载

mauricesvay超过 5 年前

Why not use a proxy?

评论 #21856353 未加载

评论 #21857719 未加载

lixtra超过 5 年前

Related wwwoffle: <a href="http://www.gedanken.org.uk/software/wwwoffle/" rel="nofollow">http://www.gedanken.org.uk/software/wwwoffle/</a>

评论 #21856725 未加载

dustingetz超过 5 年前

could this beat google? local search of anything i have seen, plus silo search sites for specific purpose like amazon and HN. would you miss anything given that google results are either bought or gamed? maybe need a better social media

fake-name超过 5 年前

Obligatory bump for my project ReadableWebProxy (<a href="https://github.com/fake-name/ReadableWebProxy" rel="nofollow">https://github.com/fake-name/ReadableWebProxy</a>) that was originally intended to do this.At this point, it does a GIANT pile of additional things most of which are specific to my interests, but I think it might be at least marginally interesting to others.It does both full autonomous web-spidering of sites you specify, as well as synchronous rendering (You can browse other sites through it, with it rewriting all links to be internal links, and content for unknown sites fetched on-the-fly).I solve the javascript problem largely by removing all of it from the content I forward to the viewing client, though I do support remote sites that load their content through JS via headless chromium (I wrote a library for managing chrome that exposes the entire debugging protocol here: <a href="https://github.com/fake-name/ChromeController" rel="nofollow">https://github.com/fake-name/ChromeController</a> <a href="https://pypi.org/project/ChromeController/" rel="nofollow">https://pypi.org/project/ChromeController/</a>).

评论 #21857045 未加载

评论 #21858415 未加载

评论 #21857046 未加载

评论 #21856687 未加载