I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.<p>There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?<p>One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with <a href="https://github.com/harvard-lil/bag-nabit">https://github.com/harvard-lil/bag-nabit</a> , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.<p>Some open questions we'd love help on --<p>* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.<p>* Another is how to find the most valuable things to preserve that <i>aren't</i> directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.