Wayback Machine Hits 400,000,000,000

300 pointsby tweakzabout 11 years ago

14 comments

leorockyabout 11 years ago

And not one of those hits is from Quora due to their robots.txt:<a href="https://web.archive.org/web/http://www.quora.com/" rel="nofollow">https://web.archive.org/web/http://www.quora.com/</a>Good job Quora, preserving all that crowd sourced content away from the crowd, keeping it from everyone not logged in. Hats off to getting into YC so you can post your job openings on the HN home page and get some press. This doesn't add anything to your image though, it just takes a little away from YC.On another better note, a great big thank you to the wayback machine for all of the public good it does. Now there's an organization that is amazing and wonderful and enriching our lives in an open and honest way with information.

评论 #7726600 未加载

评论 #7724525 未加载

评论 #7729880 未加载

keenerdabout 11 years ago

PSA/ranty thing: Just because something is archived in the Wayback machine, do not trust that archive.org will keep it there for all time. If you need something, make a local copy! A few months ago TIA changed their stance on robots.txt. They now retroactively honor robot blocks. Now any site can completely vanish from the archives.Let's say I died tomorrow. My family lets my domain slip. A squatter buys it and throws up a stock landing page, with a robots.txt that forbids spidering. TIA would delete my entire site from their index.I've already lost a few good sites to this sort of thing. If you depend on a resource, archive it yourself.edit - Official policy: <a href="https://archive.org/about/exclude.php" rel="nofollow">https://archive.org/about/exclude.php</a>If I am reading it properly, once blocked they never check later in case of a change of heart? No procedure for getting re-indexed at all?

评论 #7724503 未加载

评论 #7724702 未加载

dredmorbiusabout 11 years ago

I'd like to rave about an underappreciated but absolutely brilliant piece of the Internet Archive's infrastructure: its book reader (called, I gather, "BookReader").TIA includes copious media archives including video, audio, and books. The latter are based on full-image scans and can be read online.I generally dislike full-format reading tools: Adobe Acrobat, xpdf, evince, and other PDF readers all have various frustrations. Google's own online book reader is a mass of Web and UI frustrations.I'm a guy who almost always prefers local to Web-based apps.TIA's book reader is the best I've seen anywhere, hands down.It's fast, it's responsive. The UI gets out of the way. Find your text and hit "fullscreen". Hit 'F11' on your browser to maximize it, you can then dismiss the (subtle) UI controls off the page and you are now ... reading your book. Just the book. No additional crap.Page turn is fast. Zoomed, the view seems to autocrop to the significant text on the page. Unlike every last damned desktop client, the book remains positioned on the screen in the same position as you navigate forward or backward through the book. Evince, by contrast, will turn a page and then position it with the top left corner aligned. You've got to. Reposition. Every. Damned. Page. Drives me insane (but hey, it's a short trip).You can seek rapidly through the text with the bottom slider navigation.About the only additions I could think of would be some sort of temporary bookmark or ability to flip rapidly between sections of a book (I prefer reading and following up on footnotes and references, this often requires skipping between sections of a text).Screenshot: <a href="http://i.imgur.com/Reg8KLB.png" rel="nofollow">http://i.imgur.com/Reg8KLB.png</a>Source: <a href="http://archive.org/stream/industrialrevol00toyngoog#page/n6/mode/2up" rel="nofollow">http://archive.org/stream/industrialrevol00toyngoog#page/n6/...</a>But, for whomever at TIA was responsible for this, thank you. From a grumpy old man who finds far too much online to be grumpy about, this is really a delight.This appears to be an informational page with more links (including sources):<a href="https://openlibrary.org/dev/docs/bookreader" rel="nofollow">https://openlibrary.org/dev/docs/bookreader</a>

评论 #7724496 未加载

评论 #7724724 未加载

merittabout 11 years ago

Has there been any High Scalability articles on their infrastructure? We have a similar need: storing a large volume of text-based content over a period of time, with versioning as well. On top of it we have various metadata. We're currently storing everything in MySQL -- a lightweight metadata row and a separate table for the large (~400KB on average) BLOB fields in a compressed table.We're looking at ways to improve our architecture: simply bigger+faster hardware? Riak with LevelDB as a backend? Filesystem storing with database for the metadata? We even considered using version control such as git or hg but that proved to be far too slow for reads compared to a PK database row lookup.Any HN'ers have suggestions?

评论 #7723665 未加载

评论 #7723618 未加载

评论 #7724306 未加载

swalshabout 11 years ago

If you're looking to donate, they take bitcoin too! <a href="https://archive.org/donate/index.php" rel="nofollow">https://archive.org/donate/index.php</a>

pimlottcabout 11 years ago

A little known fact is that there is a mirror of the Wayback Machine hosted by [The Bibliotheca Alexandrina:<a href="http://www.bibalex.org/isis/frontend/archive/archive_web.aspx" rel="nofollow">http://www.bibalex.org/isis/frontend/archive/archive_web.asp...</a>I have sometimes had luck retrieving pages from this mirror that were unavailable (or returned errors) in the main site.

alternizeabout 11 years ago

awesome! it's a great tool to go back in time to check out our past websites full of blinking gifs and whatnot.I didn't know that they also maintain the "HTTP Archive", showing website latency over time as well as some interesting live-statistics: <a href="http://httparchive.org/" rel="nofollow">http://httparchive.org/</a>

评论 #7724087 未加载

评论 #7723493 未加载

Kenjiabout 11 years ago

Can anyone explain to me how displaying those sites on demand is not copyright infringement? I'm seriously curious, I don't know much about copyright laws.

评论 #7723586 未加载

评论 #7723738 未加载

Vecriosabout 11 years ago

I still cannot fathom how they are able to store huge amounts of data and not run out of space. Care anyone to explain?

评论 #7724035 未加载

jackschultzabout 11 years ago

Funny story about the Wayback Machine and how it helped me. I had let my blog go into disrepair for a couple months, and eventually, when I went back to it, I found that since I hadn't kept up with security updates, I wasn't able to access any of my old posts.When I went back to start writing again (this time using paid hosting so I didn't have to deal with that), I was disappointed that I wasn't going to have ~20-30 posts I had before. On a hunch, I checked the Wayback Machine and found that it had archived about 15 of my posts! Very excited that I could restore some of my previous writings.

ultrasandwichabout 11 years ago

> Before there was Borat, there was Mahir Cagri. This site and the track it inspired on mp3.com created quite a stir in the IDM world, with people claiming that “Mahir Cagri” was Turkish for “Effects Twin” and that the whole thing was an elaborate ruse by Richard D. James (Aphex Twin). (Captured December 29, 2004 and December 7, 2000)Okay this just blew my mind. Anyone else follow Aphex Twin's various shenanigans? Was this ever investigated further?

sutroabout 11 years ago

Nice work on this over the years, gojomo et al.

mholtabout 11 years ago

Cool, but on a lot of sites (including some of my own, from 10+ years ago to recently) it doesn't get hardly any of the images. Am I the only one experiencing this?

评论 #7724778 未加载

riettaabout 11 years ago

Wow, that's one billion more pages than there are stars in our Milky Way galaxy. That's a lot!