TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The removal of “noindex” from the Internet Archive, and associated risks

3 pointsby fieryskiff11almost 2 years ago

2 comments

634636346almost 2 years ago
Like one of the commenters in that thread said, this sounds like they were using the noidex feature to use the IA as a personal <i>private</i> backup, and thus abusing it, and ruined it for everyone else. The IA is great as a personal <i>public</i> backup. (For example, I&#x27;ve deliberately submitted copies of certain OSS projects I&#x27;ve worked on to the Wayback Machine.)
评论 #36864871 未加载
fieryskiff11almost 2 years ago
Full excerpt:<p>Surely those who are long-time users of the Internet Archive must have known about the parameter &quot;noindex&quot; in items which makes it hidden from the internal search engine, although remains on their servers. This is essential cases where which the uploader had uploaded content that is controversial in certain jurisdictions but doesn&#x27;t violate US law in any way and they fear that making it publicly in search engines will invite unnecessary hassles and liabilities to the Internet Archive. Modi BBC documentary might be one tenuous example.<p>Besides, some people might be fine that their data will be studied by archivists in the far future, say a hundred years or more, however paradoxically in terms of near future and present they are not really okay for privacy&#x2F;security reasons and prefer to see it noindexed, as opposed to the dichotomy that either an item is fully public, or doesn&#x27;t exist at all, which in one way or more defeats the purpose of an &quot;archive&quot;.<p>Here&#x27;s one of the plausible example. A political activist or so want their memories to be archived for posterity becuase they think it will be helpful for far future archivists&#x2F;researchers who study their career et al. However they prefer to see it hidden but archived because right now bad actors could use that info to doxx or harass them.<p>There was even an informal policy by Internet Archive to noindex some YouTube videos out of copyright-related reasons, per this.<p>However, they had remove the function from casual usage about May or June, perhaps after the mass-scraping DDOS incident, and unhide most of the items that are used to be noindexed. Therefore I had emailed the Internet Archive to ask for an explanation and to get them reverse the decision which I think is hare-brained.<p>After a few going forth and backs, they finally come with this reply which I feel is totally ludicrous:<p>There is no bug or mistake in removing no-index settings for many Internet Archive items in the Community collection.<p>At no point was the Archive contacted to arrange a situation of no-indexing (or Darking) items with an intention of later release; the no-index setting was not documented for this use, and represented a security hole that was closed. Tens of thousands of items were found, being used for encrypted files hidden from the search engine, and represented a major problem, so many items have been removed or set noindex quickly.<p>A number of people have contacted us explaining situations where items might need to be made no-indexed, in a collection for later or timed release for example, but they&#x27;ve done it with communication and discussing their needs, not just uploading files under disposable accounts and then assuming the archive would keep them un-accessible in perpetuity. In some cases their requests have gotten arrangements so that community items that were noindex are noindex again, in separate collections.<p>A situation can theoretically exist where the original uploader can e-mail us from their e-mail address and discuss arrangements, but you&#x27;ve indicated you intentionally obfuscated your location and have disposed your addresses. If you&#x27;re able to gain access again, you can mail through those addresses.<p>An additional situation is you can e-mail info@archive.org if you want to report items at the archive (by identifier) that you believe might need to be removed from the archive; we receive a number of these requests throughout the months and respond according to policy.<p>It is as if enshittification which had plagued Reddit just not long ago, has now made its landfall to the Internet Archive. The removal of &quot;noindex&quot; had in my two cents, destroyed the chance to attain delicate balance between preservation and privacy, hence meaning increased vulnerability to privacy and copyright related risks.<p>Perhaps, if the enshittification proves irreversible, there is Texas-based permanent.org which may one day become a successor of the Internet Archive.<p>Here&#x27;s a pseudocode for them in case they come to their senses:<p>noindex items if:<p>(<p>items-noindexed-by-user-in-the-past = true;<p>OR items-noindexed-by-IA-in-the-past = true);<p>AND (<p>items-get-reindexed-voluntarily-by-USER-before-May-2023 = false;<p>OR<p>items-get-reindexed-voluntarily-by-IA-before-May-2023 = false;<p>)<p>Bonus:<p>Bit bloody arrogant isn&#x27;t it, assuming you&#x27;ll be of historical interest in a hundred years time and that IA ought to start keeping records of you now for that.<p>Bonus:<p>Some here said that hidden files are bad because they can be abused, but that doesn&#x27;t mean that legitimate uses of the function can be handwaved away with a wand, just like how a coin has two sides. Thus it appears that there&#x27;s a catch-22 situation here.<p>However, there&#x27;s always the possibility of making a census in a separate area (whether internal or not, as a json file or something) of these hidden files if the Internet Archive wants them to remain accessible in the very long run. Furthermore it&#x27;s a time sink for the IA employees if all the noindex requests are set to go through them manually every time; I bet they don&#x27;t appreciate having to review these and that hence distracting them from more important pursuits such as scanning books.<p>send them an email and loop them in<p>In fact, they had been briefed, both implicitly and explicitly on the relevant use case scenarios through email recently and previously. They at one point stated that Internet Archive is a library, even though there are differences between the terms &quot;library&quot; and &quot;archive&quot;.<p>Put it simply, think about the Library of Congress and the NARA. Public access to the latter&#x27;s collections is normally harder than the former. IA had already chose to die on the &quot;archive&quot; hill when they used that term to brand themselves, and it&#x27;s not a stretch to describe it as enshittification making a landfall when they tried to deviate from that core purpose as embodied by the term.<p>Finally as a disclaimer, under the penalty of perjury I am in no way had condoned and supported CSA behaviors, let alone upload those heinous items to the IA.<p>Bonus:<p>You&#x27;re doing something nefarious. Otherwise you&#x27;d be more forthcoming about what you&#x27;re uploading. The only way you&#x27;re going to be able to recruit people to your cause is if you&#x27;re more transparent.<p>Bonus:<p>Please keep the snarky BS elsewhere. All I&#x27;ve mainly uploaded to the IA are personal memories and creations, dating as far back as childhood. With the present context. one additional benefits of noindex function is that greedy companies can&#x27;t easily scrape it and profit it on the back.<p>More than that, I admit that I engaged in some political activism for Ukraine and Hong Kong in the past and present, the latter which had placed me under the radar of their notorious National Security Law.<p>You&#x27;re being entitled. Just because they used the word &#x27;archive&#x27; it doesn&#x27;t mean they&#x27;re obliged to follow your interpretation of what that means.<p>If you want data preserved, pay for it. Don&#x27;t expect IA to do it for you.<p>It&#x27;s a non-profit, period, which are sustained through donations of any kind. After all it walks and behaves as if it&#x27;s a poor man&#x27;s archive all along. So much for &quot;Universal Access to All Knowledge&quot; if you force everyone to pay in order to upload their files. Without question you all will turn sour if that logic is applied to the Wayback Machine.<p>Edit:<p>IA is not your personal cloud backup. They&#x27;re not obliged to host it for you, privately or publicly. Being a political activist is neither here nor there.<p>Hmm. I must say that the conception as quoted above, is no longer unfamiliar because those were normally used to stonewall criticisms against unpopular changes by tech companies or platforms, most recently by spez and his supporters during the API protests. That being said, you should be careful next time when you&#x27;re entertaining it at all.<p>By the way, Reddit had now made photos in posts pretty much unarchivable.<p>Bonus:<p>It&#x27;s strange to see that long-term thinking has been unexpectedly reviled here. I suppose that many veteran archivists who would&#x27;ve fully understood such idea, have joined others to leave Reddit in the wake of unpopular API changes.<p>To understand what &quot;long-term thinking&quot; is, think of this: Let&#x27;s say someone is a small-time actor. They might make a breakthrough one day in Hollywood, or others so much that they become a household name. Their supporters, detractors and others will certainly see overarching interests into seeing their works and deeds preserved, for scrutiny, commemoration, and any imaginable purposes.<p>The hypothetical actor is a believer in archiving and digital preservation before famous, so they put a lot of items in the IA for years. While they are certainly okay with all their items being discovered and publicized by digital archaeologists a hundred years later or more, they would rather prefer that some of their items being noindexed, undisturbed, perhaps for a hundred years or indefinitely, like the artifacts in the pyramids, since premature exposure would mean TMZ and other unwelcome harassments.<p>As can be seen in this video, the IA founder is in acceptance with the general idea of long-term thinking and preservation. It&#x27;s no longer &quot;arrogant&quot; and &quot;unfeasible&quot; to expect that personal legacies be preserved at any level and degree according to the user&#x27;s wishes. For now I&#x27;ll end this long comment with this Twitter thread by the Long Now Foundation that explains why all &quot;data hoarders&quot; do what&#x27;s done.