Archivists work to save disappearing data.gov datasets

801 点作者 johnneville3 个月前

25 条评论

JackC3 个月前

I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with <a href="https://github.com/harvard-lil/bag-nabit">https://github.com/harvard-lil/bag-nabit</a> , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.Some open questions we'd love help on --* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.

评论 #42903078 未加载

评论 #42906154 未加载

评论 #42903195 未加载

评论 #42902983 未加载

评论 #42903819 未加载

评论 #42902903 未加载

评论 #42904031 未加载

评论 #42912563 未加载

评论 #42903476 未加载

0n0n0m0uz3 个月前

One of the USA greatest strengths is the almost unprecedented degree of transparency of governments records going back decades. We can actually see the true facts including when our government has lied to us or covered things up. Many other nations do not have this luxury and it has provided the evidentiary basis for both legal cases and "progress" in general. Not surprising that authoritarians would target and destroy data as it makes their objective of a post-truth society that much easier

评论 #42907111 未加载

chrishoyle3 个月前

Beyond federal websites (.gov, .mil) there are lot of gov contractor websites that are being taken down (presumably at the demand of agencies) that contain a wealth of information and years of project research.Some below of contractors that work with US AID:- <a href="https://www.edu-links.org/" rel="nofollow">https://www.edu-links.org/</a> (taken down)- <a href="https://www.genderlinks.org/" rel="nofollow">https://www.genderlinks.org/</a> (taken down)- <a href="https://usaidlearninglab.org/" rel="nofollow">https://usaidlearninglab.org/</a> (taken down)- <a href="https://agrilinks.org/" rel="nofollow">https://agrilinks.org/</a> (presumably at risk)- <a href="https://www.climatelinks.org/" rel="nofollow">https://www.climatelinks.org/</a> (presumably at risk)- <a href="https://biodiversitylinks.org/" rel="nofollow">https://biodiversitylinks.org/</a> (presumably at risk)

评论 #42903914 未加载

评论 #42915851 未加载

cle3 个月前

I’ve been archiving data.gov for over a year now and it’s not unusual to see large fluctuations on the order of hundreds or thousands of datasets. I’ve never bothered trying to figure out what exactly is changing, maybe I should build a tool for that…

评论 #42901169 未加载

jl63 个月前

> The outlet reports that deleted datasets "disproportionately" come from environmental science agencies like the Department of Energy, National Oceanic and Atmospheric Administration (NOAA), and the Environmental Protection Agency (EPA).Was there an EO targeting these areas?

评论 #42901110 未加载

评论 #42900907 未加载

评论 #42900811 未加载

评论 #42900736 未加载

dang3 个月前

Related ongoing thread:CDC data are disappearing - <a href="https://news.ycombinator.com/item?id=42897696">https://news.ycombinator.com/item?id=42897696</a> - Feb 2025 (216 comments)

eh_why_not3 个月前

What's a good way to be an "Archivist" on a low budget these days?Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).Is this a use case for Torrents? What's the most suitable architecture available today for this?

评论 #42903120 未加载

评论 #42905535 未加载

评论 #42903920 未加载

crowcroft3 个月前

Still, even with best efforts this is such a shame. There is always going to be a question around governance over the data, integrity, and potentially chain of custody as well. If the goal is to muddy the waters and create a narrative that whatever might be in this data isn't reliable or accurate then mission accomplished. I don't see how anything can stop that.Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.

chrishoyle3 个月前

Related ongoing discussionThe government information crisis is bigger than you think it is - <a href="https://news.ycombinator.com/item?id=42895331">https://news.ycombinator.com/item?id=42895331</a>

debeloo3 个月前

Is this normal when there's change in presidency?

评论 #42899938 未加载

smrtinsert3 个月前

Are datasets mirrored anywhere where the govt doesn't automatically have a take down authority? If not there should be a mirroring effort.

评论 #42900569 未加载

sunk1st3 个月前

I don’t see a list of the datasets that have gone missing. Is there a list?

评论 #42900436 未加载

derektank3 个月前

Does anyone know if the St Louis Federal Reserve (and I guess the federal reserve banks generally) is subject to presidential executive orders or is it entirely responsible to the Federal Reserve Board and the St. Louis Bank president? FRED is the only dataset I access regularly

generalizations3 个月前

Do we know what datasets these are? Do we actually have a diff here so we know what's been removed? There's a lot of assumptions being thrown around here, but we don't even know if this is some kind of malicious compliance. An actual list of what's been removed would probably clear the air a lot.As one of the reddit comments (in the thread linked by the article) pointed out,> During the start of Biden’s term, On 6th feb data.gov had “218,384 DATASETS” but on 7th feb it only had “192,180 DATASETS”

choobacker3 个月前

It's impressive that volunteers are stepping up to archive this. I understand the desire to keep this open data available.How much of this sort of effort results in that data being used? Are there success stories for these datasets being discoverable enough and useful to others?

andyjohnson03 个月前

If the intention is to restore these data sets at some future date, when sanity has possibly been restored, then there needs to be a way to demonstrate that the archived data hasn't itself been modified. Without that, malign actors (e.g. oil/gas lobby) could very easily poison the future.

评论 #42908338 未加载

liontwist3 个月前

I think people are interested in archiving and the political image associated with that but I don’t think anybody cares about the content. Who is going to go back and read Biden era agency publications?

评论 #42919764 未加载

downrightmike3 个月前

Already seeing: 404 Not Found: Requested route ('ed-public-download.app.cloud.gov') does not exist.

pluto_modadic3 个月前

don't they have to have to have done this /before/ it gets deleted?

ThinkBeat3 个月前

I hope volunteers and others are able to save as much as possible of the data.Removing and altering of the information and data is one of the fundamental threats in our digital world.It probably makes the most sense to do this on a daily basis. If something new, if published, grab it as soon as possible.Data can also be redacted or altered for a variety of reasons, being able to see the before and after states can be illuminating.Something I feel is missing here are statistics for each administration.Does this only happen under a Trump administration, or does it happen to smaller or larger extent under other administrations?I don't know how far back this federal goes so it might not be easy,

bawolff3 个月前

Tbh, im kind of surprised these things weren't being archived as they were being published. Trump is an extreme case, but its not the first time a change in administration resulted in removing old websites.

notavalleyman3 个月前

I read, in past days, that the man who ordered the construction of the nearly infinite Wall of China was that First Emperor, Shih Huang Ti, who likewise ordered the burning of all the books before him. That the two gigantic operations - the five or six hundred leagues of stone to oppose the barbarians, the rigorous abolition of history, that is of the past - issued from one person and were in a certain sense his attributes, inexplicably satisfied me and, at the same time, disturbed me.- Borges

评论 #42900337 未加载

评论 #42900635 未加载

评论 #42900072 未加载

honestSysAdmin3 个月前

Let's make torrents and seed them.

评论 #42910043 未加载

exe343 个月前

First week we had mass deportation, second week we've heard of the building of concentration camps for undesirables, and now the modern version of book burning. There's something different about this republican government.

评论 #42901406 未加载

评论 #42901284 未加载

评论 #42901451 未加载

评论 #42905414 未加载

评论 #42901377 未加载

strictnein3 个月前

99.9% of commenters here seem to have missed this:> For example, in the days after Joe Biden was inaugurated, data.gov showed about 1,000 datasets being deleted as compared to a day before his inaugurationIt's almost like this stuff happens regularly. If <insert Dem savior> wins in 2028, tons of government websites will also change in the first couple weeks of their presidency. Is it because they're a fascist dictator? Or is it because those websites reflect the administration's viewpoints on issues?Wish people would take a deep breath and step back and think a little more. I despise Trump, but there's crying wolf and then there's the current state of media and online discourse. Trump thrives in this type of environment. He purposefully fosters it. Playing gotcha with him doesn't work because he doesn't care.

评论 #42904929 未加载

评论 #42910204 未加载

25 条评论

JackC3 个月前

评论 #42903078 未加载

评论 #42906154 未加载

评论 #42903195 未加载

评论 #42902983 未加载

评论 #42903819 未加载

评论 #42902903 未加载

评论 #42904031 未加载

评论 #42912563 未加载

评论 #42903476 未加载

0n0n0m0uz3 个月前

评论 #42907111 未加载

chrishoyle3 个月前

评论 #42903914 未加载

评论 #42915851 未加载

cle3 个月前

评论 #42901169 未加载

jl63 个月前

评论 #42901110 未加载

评论 #42900907 未加载

评论 #42900811 未加载

评论 #42900736 未加载

dang3 个月前

Related ongoing thread:CDC data are disappearing - <a href="https://news.ycombinator.com/item?id=42897696">https://news.ycombinator.com/item?id=42897696</a> - Feb 2025 (216 comments)

eh_why_not3 个月前

评论 #42903120 未加载

评论 #42905535 未加载

评论 #42903920 未加载

crowcroft3 个月前

chrishoyle3 个月前

debeloo3 个月前

Is this normal when there's change in presidency?

评论 #42899938 未加载

smrtinsert3 个月前

Are datasets mirrored anywhere where the govt doesn't automatically have a take down authority? If not there should be a mirroring effort.

评论 #42900569 未加载

sunk1st3 个月前

I don’t see a list of the datasets that have gone missing. Is there a list?

评论 #42900436 未加载

derektank3 个月前

generalizations3 个月前

choobacker3 个月前

andyjohnson03 个月前

评论 #42908338 未加载

liontwist3 个月前

评论 #42919764 未加载

downrightmike3 个月前

Already seeing: 404 Not Found: Requested route ('ed-public-download.app.cloud.gov') does not exist.