Show HN: Tesoro – Personal internet archive

174 pointsby agamblealmost 8 years ago

27 comments

JackCalmost 8 years ago

For personal web archiving, I highly recommend <a href="http://webrecorder.io" rel="nofollow">http://webrecorder.io</a>. The site lets you download archives in standard WARC format and play them back in an offline (Electron) player. It's also open source and has a quick local setup via Docker - <a href="https://github.com/webrecorder/webrecorder" rel="nofollow">https://github.com/webrecorder/webrecorder</a> .Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now captures online performance art for an art museum. What he's doing with capture and playback of Javascript, web video, streaming content, etc. is state of the art as far as I know.(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)For OP, I would say consider building on and contributing back to Webrecorder -- or alternatively figure out what Webrecorder is good at and make sure you're good at something different. It's a crazy hard problem to do well and it's great to have more ideas in the mix.

评论 #14645184 未加载

评论 #14645173 未加载

评论 #14648472 未加载

评论 #14647435 未加载

评论 #14645590 未加载

评论 #14645083 未加载

smoyeralmost 8 years ago

It's not mine unless it's running on my own servers or computer - I created a really rough version of this several years ago that is saved to my computer (and from there into box).

评论 #14644902 未加载

评论 #14647080 未加载

Piskvorrralmost 8 years ago

That's just as much "my own" as The Internet Archive: a website Out There somewhere. Worse, it's much more likely to rot and disappear than archive.org. Now, if I could run this locally...(Yes, yes, `wget --convert-links`, I know. Not quite as convenient, though.)

评论 #14644520 未加载

评论 #14645153 未加载

评论 #14648312 未加载

j_salmost 8 years ago

I would be interested in an attestation service that can provide court-admissable evidence that a particular piece of content was publically accessible on the web at a particular point in time via a particular url.I believe the only way to incentivise participation in such a system is by paying for timestamp'ed signatures, eg. "some subset of downloaded [content] from [url] at [time] hashed to [hash]" all tucked into a Bitcoin transaction or something. There are services that will do this with user-provided content[1]; I am looking for something that will pull a url and timestamp the content.This would also be a way to detect when different users are being served different content at the same url, thus the need for a global network of validators.[1] <a href="https://proofofexistence.com/" rel="nofollow">https://proofofexistence.com/</a>

评论 #14647633 未加载

unicornpornalmost 8 years ago

In what way could this considered to be “your own internet archive”? I see no way to register a user and save pages to a collection.If you really want to create your own archive, set up a Live Archiving HTTP Proxy[1], run SquidMan [2] or check out WWWOFFLE[3].If you want something simpler, have a look at Webrecorder[4] or a paid Pinboard account with the “Bookmark Archive”[5].[1] <a href="http://netpreserve.org/projects/live-archiving-http-proxy/" rel="nofollow">http://netpreserve.org/projects/live-archiving-http-proxy/</a>[2] <a href="http://squidman.net/squidman/index.html" rel="nofollow">http://squidman.net/squidman/index.html</a>[3] <a href="http://www.gedanken.org.uk/software/wwwoffle/" rel="nofollow">http://www.gedanken.org.uk/software/wwwoffle/</a>[4] <a href="https://webrecorder.io/" rel="nofollow">https://webrecorder.io/</a>[5] <a href="https://pinboard.in/upgrade/" rel="nofollow">https://pinboard.in/upgrade/</a>

评论 #14644979 未加载

评论 #14645964 未加载

rahielalmost 8 years ago

An internet archive can only provide value if it's there for the long-term. What's your plan to keep this service running if it gets popular? For example, archive.is costed about $2000/month at the start of 2014 [1]. I expect it to cost even more now.[1]: <a href="http://blog.archive.is/post/72136308644/how-much-does-it-cost-you-to-host-a-website-of" rel="nofollow">http://blog.archive.is/post/72136308644/how-much-does-it-cos...</a>

venningalmost 8 years ago

Thoughts:I like the look. Very clean. I like how fast it's responding; better than archive.org (though, obviously, they have different scaling problems)."Your own internet archive" might be overselling it, as other commenters have pointed out; the "Your" feels a bit misleading. I think "Save a copy of any webpage." gives a better impression, which you use on the site itself.The "Archive!" link probably shouldn't work if there's nothing in the URL box. It just gives me an archive link that errors. Example: [1]Using it on news.YC as a test gave me errors with the CSS & JS [2]. This might be due to the fact that HN uses query parameters in their CSS and JS, which repeat in the tesoro URL, which you may not be parsing correctly.Maybe have something in addition to an email link for submitting error reports like the above, just cause I'd be more likely to file a GitHub issue (even if the repo is empty) than send a stranger an email.As other commenters have pointed out, archive.is also does this, and their longevity helps me feel confident that they'll still be around. Perhaps, if you wish to differentiate, offer some way for me to "own" the copy of the page, like downloading it or emailing it to myself or sharing it with another site (like Google Docs or Imgur) to leverage redundancy, or something like that. Just a thought.All in all, nice Show HN.EDIT: You also may want to adjust the header to work properly on mobile devices. Still though, nice job. Sorry if I'm sounding critical.[1] <a href="https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8" rel="nofollow">https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8</a>[2] <a href="https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1" rel="nofollow">https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1</a>

评论 #14644961 未加载

bfirshalmost 8 years ago

What's the best way to automatically archive all of the data I produce on websites? Facebook, Twitter, Instagram, blogs, and so on. At some point these services will disappear, and I want to preserve them.I know a lot of these sites have archiving features, but want something centralised and automatic.

评论 #14646533 未加载

评论 #14646927 未加载

akerroalmost 8 years ago

Nice, post it on <a href="https://www.reddit.com/r/DataHoarder/" rel="nofollow">https://www.reddit.com/r/DataHoarder/</a>They will love it!

zippoxeralmost 8 years ago

Cool tool, but by using it, you depend on it staying alive for longer than any page you archive on it.This got me thinking about how a decentralized p2p internet archive could solve the trust problem that exists in centralized internet archives. Such solution could also increase the capacity of archived pages and the frequency at which archived pages are updated.It is true that keeping the entire history of the internet on your local drive is likely impossible, but a solution similar to what Sia is doing could solve this problem: split each page to 20 pieces and distribute each piece to 10 peers such that every y pieces can recover the original page. So, you only have to trust that 10 peers out of 20 that store a page are still alive to get the complete page.The main problem I can see right now would be lack of motivation to contribute to the system -- why would people run nodes? Just because it would feature a yet another cryptocurrency? Sure, this could hold now, but when the cryptocurrency craze quiets down and people stop buying random cryptocurrencies just for the sake of trading them, what then? Who would run the nodes and why?

评论 #14647200 未加载

j_salmost 8 years ago

The discussion 3 months ago on bookmarks mentioned several options for archiving pages (some locally): Ask HN: Do you still use browser bookmarks? | <a href="https://news.ycombinator.com/item?id=14064096" rel="nofollow">https://news.ycombinator.com/item?id=14064096</a>extensions: Firefox "Print Edit" Addon / Firefox Scrapbook X / Chrome Falcon / Firefox Recollopen source: Zotero / WorldBrain / Wallabagcommercial: Pinboard / InstaPaper / Pocket / Evernote / Mochimarks / Diigo / PageDash / URL Manager Pro / Save to Google / OneNote / Stash / Fetchingpublic: <a href="http://web.archive.org" rel="nofollow">http://web.archive.org</a> / <a href="https://archive.is/" rel="nofollow">https://archive.is/</a>

idlewordsalmost 8 years ago

You're going to get this service shut down if you let anonymous people republish arbitrary content while running everything on Google.I (obviously) think personal archives are a great idea, but republishing is a hornets' nest.

Retr0spectrumalmost 8 years ago

Is this any different to archive.is?If I want my own archive, Ctrl+S in Firefox usually works fine for me.

crispytxalmost 8 years ago

You know your site actually does a better job reproducing webpages than archive.org. I've noticed that if you use a CDN to serve up CSS & JS for a webpage that you're trying to archive on archive.org, it won't render correctly. On your site, there doesn't seem to be a problem including CSS & JS from an external domain. Thumbs up :)

评论 #14645236 未加载

zichyalmost 8 years ago

So this is like archive.is, but I can't search through archived sites?

CM30almost 8 years ago

When you said 'own internet archive' I thought you meant some sort of program you could download that'd save your browsing history (or whatever full website you wanted) to your hard drive. I think that would have been significantly more useful here.As is it, while it's a nice service, it's still got all the issues of other archive ones:1. It's online only, so one failed domain renewal or hosting payment takes everything offline.2. It being online also means I can't access any saved pages if my connection goes down or has issues.3. The whole thing is wide open to having content taken down by websites wanting to cover their tracks. I mean, what do you do if someone tells you to remove a page? What about with a DMCA notice?It's a nice alternative to archive.is, but still doesn't really do what the title suggests if you ask me.

jpalomakialmost 8 years ago

This might be a good use case for distributed storage (IPFS?).Instead of hosting this directly on my computer, it would be interesting to have a setup where the archiving is done via the service and I would just provide somewhere a storage space where the content would end up being mirrored (just to guarantee that my valuable things are saved at least somewhere, should the the other nodes decide to remove the content).I would prefer this setup, because it would be easily accessible for me from any device and I would not need to worry about running some always available system. With some suitable P2P setup my storage node would have less strict uptime requirements.

评论 #14651826 未加载

dbzalmost 8 years ago

This is pretty cool. I have a chrome extension that let's you view the cached version of a web page [1]. Would I be able to use this through an API? I currently support Google Cache, WayBack Machine, and CoralCDN, but Coral doesn't work well and I'd like to replace it with something else.[1] <a href="https://chrome.google.com/webstore/detail/cmmlgikpahieigpcclckfmhnchdlfnjd" rel="nofollow">https://chrome.google.com/webstore/detail/cmmlgikpahieigpccl...</a>

评论 #14645003 未加载

prirunalmost 8 years ago

I think you should explain why you're paying Google to archive web pages for others, ie, how do plan on benefiting from this? If you have some business model in mind, let people know now. It's the first question that comes to my mind when someone offers a service that is free yet costs the provider real money. You obviously can't pay Google to archive everyone's web pages just for the fun of it.

评论 #14645078 未加载

评论 #14645157 未加载

gorbachevalmost 8 years ago

You should try and rewrite relative links in websites that get archived. I tested your app with a news site, and all the links go to archive.tesoro.io/sites/internal/url/structure/article.htmlI also second the need for user accounts. If I am to use your site as my personal archive, then I would need to log in and create a collection of my own archived sites.

arkenflamealmost 8 years ago

I made a simple Chrome extension to automatically save local copies of pages you bookmark, if you prefer that instead: <a href="https://chrome.google.com/webstore/detail/backmark-back-up-the-page/cmbflafdbcidlkkdhbmechbcpmnbcfjf" rel="nofollow">https://chrome.google.com/webstore/detail/backmark-back-up-t...</a>

lozzoalmost 8 years ago

it would be nice to have a bit of explanation on how it works and why we can be confident that we can rely upon it

评论 #14644772 未加载

jdc0589almost 8 years ago

> Tesoro saves linked assets, such as images, Javascript and CSS files.I'm confused. It looks like image sources in "archived" pages on Tesoro still point back to the origin domain.Edit: it works as expected. I just didn't notice the relative paths.

评论 #14646124 未加载

salmonfaminealmost 8 years ago

Worth noting that Tesoro is the name of a major oil/fuel company in Texas.

NicoJuicyalmost 8 years ago

When a company went down, i downloaded every one of their clients with httrack and wget. Just to be sure their clients wouldn't lose their site. ( and some other things)I wonder what this site uses

pbhjpbhjalmost 8 years ago

How are you handling copyright infringement? Outside USAs Fair Use terms this looks like pretty blatant infringement.

评论 #14645241 未加载

评论 #14647690 未加载

skdotdanalmost 8 years ago

Nice. How are you planning to pay the servers? Your service seems quite storage-intensive.

评论 #14648353 未加载

27 comments

JackCalmost 8 years ago

评论 #14645184 未加载

评论 #14645173 未加载

评论 #14648472 未加载

评论 #14647435 未加载

评论 #14645590 未加载

评论 #14645083 未加载

smoyeralmost 8 years ago

It's not mine unless it's running on my own servers or computer - I created a really rough version of this several years ago that is saved to my computer (and from there into box).

评论 #14644902 未加载

评论 #14647080 未加载

Piskvorrralmost 8 years ago

评论 #14644520 未加载

评论 #14645153 未加载

评论 #14648312 未加载

j_salmost 8 years ago

评论 #14647633 未加载

unicornpornalmost 8 years ago

评论 #14644979 未加载

评论 #14645964 未加载

rahielalmost 8 years ago

venningalmost 8 years ago

评论 #14644961 未加载

bfirshalmost 8 years ago

评论 #14646533 未加载

评论 #14646927 未加载

akerroalmost 8 years ago

Nice, post it on <a href="https://www.reddit.com/r/DataHoarder/" rel="nofollow">https://www.reddit.com/r/DataHoarder/</a>They will love it!

zippoxeralmost 8 years ago

评论 #14647200 未加载

j_salmost 8 years ago

idlewordsalmost 8 years ago

Retr0spectrumalmost 8 years ago

Is this any different to archive.is?If I want my own archive, Ctrl+S in Firefox usually works fine for me.

crispytxalmost 8 years ago

评论 #14645236 未加载

zichyalmost 8 years ago

So this is like archive.is, but I can't search through archived sites?

CM30almost 8 years ago

jpalomakialmost 8 years ago

评论 #14651826 未加载

dbzalmost 8 years ago

评论 #14645003 未加载

prirunalmost 8 years ago

评论 #14645078 未加载

评论 #14645157 未加载

gorbachevalmost 8 years ago

arkenflamealmost 8 years ago

lozzoalmost 8 years ago

it would be nice to have a bit of explanation on how it works and why we can be confident that we can rely upon it

评论 #14644772 未加载

jdc0589almost 8 years ago

评论 #14646124 未加载

salmonfaminealmost 8 years ago

Worth noting that Tesoro is the name of a major oil/fuel company in Texas.

NicoJuicyalmost 8 years ago

pbhjpbhjalmost 8 years ago

How are you handling copyright infringement? Outside USAs Fair Use terms this looks like pretty blatant infringement.

评论 #14645241 未加载

评论 #14647690 未加载

skdotdanalmost 8 years ago

Nice. How are you planning to pay the servers? Your service seems quite storage-intensive.

评论 #14648353 未加载