Hi HN,<p>I just saw that one of my projects [1] which is constantly crawling websites, is still making screenshots in addition of each website. I completely forgot about the screenshots thing as I do not utilize them anymore.<p>It is in total around 1,790,000 full-page screenshots from sites that were posted on reddit, hacker news, tweets, financial news, since Jan 2014.<p>Don't ask me to open-source them and make them available for download. I dare not to get involved in any licensing issues or whatever.<p>They are on an S3 Bucket. Just got a bill from Amazon...<p>If you'd like to have them or you have any idea what I can do with them, outside of deleting, contact me at thomas@newscombinator.com<p>Thomas<p>[1] http://www.newscombinator.com
This sort of data falls under the category of "I might need this someday but I can't figure out why".<p>For the reasons you can't think of, perhaps you might consider indexing them locally then throwing them onto Glacier. The odds of you needing every single one of them (thus making Glacier cost-prohibitive) are far less than the odds of you needing one at random.<p>I haven't done the math on how many months of S3 hosting it takes to equal the upload cost once to Glacier, primarily because I don't know how big 1790000 screenshots are.<p>Alternatively, provided downloading them all to your local desktop doesn't run your S3 bill to Mars, tape drives can still be quite cost-effective ways to store a LOT of data, cheap.
> I dare not to get involved in any licensing issues or whatever<p>You might be covered under the DMCA: <a href="https://en.wikipedia.org/wiki/Dmca" rel="nofollow">https://en.wikipedia.org/wiki/Dmca</a><p>And since they're screenshots some might be considered fair use: <a href="https://en.wikipedia.org/wiki/Fair_use" rel="nofollow">https://en.wikipedia.org/wiki/Fair_use</a><p><i>Some</i> being blatant copies of logos, etc
Sounds cool and it's more like 1.8M.<p>Sadly I don't think it'd be much use for anyone since Archive.org takes care of all archiving.<p>The only time that I'd see screenshots come handy is for live previews.