And not one of those hits is from Quora due to their robots.txt:<p><a href="https://web.archive.org/web/http://www.quora.com/" rel="nofollow">https://web.archive.org/web/http://www.quora.com/</a><p>Good job Quora, preserving all that crowd sourced content away from the crowd, keeping it from everyone not logged in. Hats off to getting into YC so you can post your job openings on the HN home page and get some press. This doesn't add anything to your image though, it just takes a little away from YC.<p>On another better note, a great big thank you to the wayback machine for all of the public good it does. Now there's an organization that is amazing and wonderful and enriching our lives in an open and honest way with information.
PSA/ranty thing: Just because something is archived in the Wayback machine, do not trust that archive.org will keep it there for all time. If you need something, make a local copy! A few months ago TIA changed their stance on robots.txt. They now <i>retroactively</i> honor robot blocks. Now any site can completely vanish from the archives.<p>Let's say I died tomorrow. My family lets my domain slip. A squatter buys it and throws up a stock landing page, with a robots.txt that forbids spidering. TIA would delete my entire site from their index.<p>I've already lost a few good sites to this sort of thing. If you depend on a resource, archive it yourself.<p>edit - Official policy: <a href="https://archive.org/about/exclude.php" rel="nofollow">https://archive.org/about/exclude.php</a><p>If I am reading it properly, once blocked they never check later in case of a change of heart? No procedure for getting re-indexed at all?
I'd like to rave about an underappreciated but absolutely brilliant piece of the Internet Archive's infrastructure: its book reader (called, I gather, "BookReader").<p>TIA includes copious media archives including video, audio, and books. The latter are based on full-image scans and can be read online.<p>I generally dislike full-format reading tools: Adobe Acrobat, xpdf, evince, and other PDF readers all have various frustrations. Google's own online book reader is a mass of Web and UI frustrations.<p>I'm a guy who almost <i>always</i> prefers local to Web-based apps.<p>TIA's book reader is the best I've seen anywhere, hands down.<p>It's fast, it's responsive. The UI gets out of the way. Find your text and hit "fullscreen". Hit 'F11' on your browser to maximize it, you can then dismiss the (subtle) UI controls off the page and you are now ... reading your book. Just the book. No additional crap.<p>Page turn is fast. Zoomed, the view seems to autocrop to the significant text on the page. <i>Unlike every last damned desktop client, the book remains positioned on the screen in the same position as you navigate forward or backward through the book. Evince, by contrast, will turn a page </i>and then position it with the top left corner aligned. You've got to. Reposition. Every. Damned. Page. Drives me insane (but hey, it's a short trip).<p>You can seek rapidly through the text with the bottom slider navigation.<p>About the only additions I could think of would be some sort of temporary bookmark or ability to flip rapidly between sections of a book (I prefer reading and following up on footnotes and references, this often requires skipping between sections of a text).<p>Screenshot: <a href="http://i.imgur.com/Reg8KLB.png" rel="nofollow">http://i.imgur.com/Reg8KLB.png</a><p>Source: <a href="http://archive.org/stream/industrialrevol00toyngoog#page/n6/mode/2up" rel="nofollow">http://archive.org/stream/industrialrevol00toyngoog#page/n6/...</a><p>But, for whomever at TIA was responsible for this, thank you. From a grumpy old man who finds far too much online to be grumpy about, this is really a delight.<p>This appears to be an informational page with more links (including sources):<p><a href="https://openlibrary.org/dev/docs/bookreader" rel="nofollow">https://openlibrary.org/dev/docs/bookreader</a>
Has there been any High Scalability articles on their infrastructure? We have a similar need: storing a large volume of text-based content over a period of time, with versioning as well. On top of it we have various metadata. We're currently storing everything in MySQL -- a lightweight metadata row and a separate table for the large (~400KB on average) BLOB fields in a compressed table.<p>We're looking at ways to improve our architecture: simply bigger+faster hardware? Riak with LevelDB as a backend? Filesystem storing with database for the metadata? We even considered using version control such as git or hg but that proved to be far too slow for reads compared to a PK database row lookup.<p>Any HN'ers have suggestions?
If you're looking to donate, they take bitcoin too!
<a href="https://archive.org/donate/index.php" rel="nofollow">https://archive.org/donate/index.php</a>
A little known fact is that there is a mirror of the Wayback Machine hosted by [The Bibliotheca Alexandrina:<p><a href="http://www.bibalex.org/isis/frontend/archive/archive_web.aspx" rel="nofollow">http://www.bibalex.org/isis/frontend/archive/archive_web.asp...</a><p>I have sometimes had luck retrieving pages from this mirror that were unavailable (or returned errors) in the main site.
awesome! it's a great tool to go back in time to check out our past websites full of blinking gifs and whatnot.<p>I didn't know that they also maintain the "HTTP Archive", showing website latency over time as well as some interesting live-statistics: <a href="http://httparchive.org/" rel="nofollow">http://httparchive.org/</a>
Can anyone explain to me how displaying those sites on demand is not copyright infringement? I'm seriously curious, I don't know much about copyright laws.
Funny story about the Wayback Machine and how it helped me. I had let my blog go into disrepair for a couple months, and eventually, when I went back to it, I found that since I hadn't kept up with security updates, I wasn't able to access any of my old posts.<p>When I went back to start writing again (this time using paid hosting so I didn't have to deal with that), I was disappointed that I wasn't going to have ~20-30 posts I had before. On a hunch, I checked the Wayback Machine and found that it had archived about 15 of my posts! Very excited that I could restore some of my previous writings.
> Before there was Borat, there was Mahir Cagri. This site and the track it inspired on mp3.com created quite a stir in the IDM world, with people claiming that “Mahir Cagri” was Turkish for “Effects Twin” and that the whole thing was an elaborate ruse by Richard D. James (Aphex Twin). (Captured December 29, 2004 and December 7, 2000)<p>Okay this just blew my mind. Anyone else follow Aphex Twin's various shenanigans? Was this ever investigated further?
Cool, but on a lot of sites (including some of my own, from 10+ years ago to recently) it doesn't get hardly any of the images. Am I the only one experiencing this?