One of the most incredible parts is that they've already run feature detection on all 100M images/videos and extracted 50TB of:<p>"SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features, MFCC, SACC_Pitch, and Tonality"<p>The good part about this for researchers is not only that this saves dozens of CPU-years of computation (back of the envelope, it would take 15 years for my laptop to extract those SIFT features alone), but that any differences in learning/recognition performance on the dataset can be attributed to the algorithms in question, uncomplicated by which researcher engineered the best features for the dataset. On the other hand, it's a challenging dataset to work with because you can't just download it and process it locally as has been traditionally done. I'll be interested to see how many take advantage of it.
It seems like Yahoo is a little bit worried about possible exploitation. From the Terms of Use:<p><i>2.3. You may derive and publish summaries, analyses and interpretations of the Data, but only in a manner where it is impossible to reconstruct the Data from the publication. Small excerpts of the Data may be displayed to others or published in a scientific or technical context, solely for the purpose of describing your research and related issues and not for any commercial or anti-competitive purpose. Unless Yahoo! expressly requests no attribution, all publications resulting from research carried out using the Data must display an attribution to Yahoo!. This attribution must reference &quot;Yahoo! Webscope,” the web address <a href="http://webscope.sandbox.yahoo.com" rel="nofollow">http://webscope.sandbox.yahoo.com</a>, and the name of the specific dataset used, including version number, if applicable. This attribution should preferably appear among the bibliographic citations in the publication. If Yahoo! expressly requests no attribution, you agree not to mention Yahoo! in connection with the Data. Yahoo! invites you to provide a copy your publication to Yahoo!.</i><p>This[0] seem fairly restrictive, considering that I can just crawl flickr and get all that data and more, were I so inclined. Also kinda interesting, in this passage and the rest of the TOU: they repeatedly use `&quot;` interchangeably with actual quotation marks ("), suggesting that <i>nobody at Yahoo has proofread their own live TOU</i>. Still, the dataset seems really cool.<p>[0] ...and other parts of the agreement, but I don't want to spoil it for you, nor post its entirety as a comment.
"Yahoo is hosting a contest to build the system best capable of identifying where a photo or video was taken without using geographic coordinates."<p>Does this strike anyone else as being a bad idea?
"From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago"<p>Hey I still do that! :(<p>I wonder if my (or anyone's) film photos on Flickr are completely useless metadata-wise. Because they are all scanned so they just say "NORITSU KOKI EZ Controller". There seems to be a large portion of people (on Flickr) shooting film still but I wonder if it's only a small percentage overall.
Just when I was happy using Flickr's API for creative commons image search - <a href="http://www.outreachpanel.com/free-images/" rel="nofollow">http://www.outreachpanel.com/free-images/</a><p>They gave me this huge data to play with :)<p>In past, I have had issues with CC images that were also tagged with 'getty'. I hope they have taken care of that issue.