I've started reading: http://www.aosabook.org/en/distsys.html<p>And his first example is that of an image hosting website. My mind went on a bit of a tangent and I was curious...<p>Do image hosting sites keep a hash of their images and only store 1 copy of each image? I'm not sure the effort it would take to ensure there was a single copy of each image but... it seems like it would save some space.<p>Does anyone know?
I don't think so. The average filesize of a JPEG image on the web is what, 50KB? At those sizes, it's just not worth it to put a system in place that could introduce more bugs through extra complexity. Especially since duplicate images would be relatively rare, as in 0.00001% rare. Completely not worth it. You'd still need extra database entries too, so you're not even saving on the overhead, database size or queries.<p>Look at it this way: during the time spent coding this feature, I'm pretty sure disk drive space would grow more than the extra space you'd need for the duplicates (relatively speaking over the long-term of course).
The Imgur-creator recently did an AMA where someone asked the exact same question:<p>>do you hash and store only one copy of duplicate images?<p>>Believe it or not, we don't. All the images only use up about 3TB of storage space, so it's not really a big issue.<p>Source: <a href="http://www.reddit.com/r/IAmA/comments/y81ju/i_created_imgur_ama/?utm_source=dlvr.it&utm_medium=feed" rel="nofollow">http://www.reddit.com/r/IAmA/comments/y81ju/i_created_imgur_...</a><p>On the other hand, YouTube stores 76 PB: <a href="http://www.afshispeaks.com/2012/08/youtube-storage-costs-per-year/" rel="nofollow">http://www.afshispeaks.com/2012/08/youtube-storage-costs-per...</a>