Ask HN: Any ideas on how to thumbnail 2M images?

24 pointsby phprecoveryover 10 years ago

Hi,I'm a developer at the New York Public Library and we're currently evaluating ways to create derivatives (i.e. thumbnails) from a library of 2 million master images.At 5 seconds an image using a tool like ImageMagick (which might be optimistic), it will take 115 machine days.Any suggestions or tips to speed up the process?Edit: Original images are scanned TIFFs, 25-30MB each.

22 comments

samptempover 10 years ago

I'm kinda a little disappointed at the answers provided here.1) I downloaded a 49.6 MB TIFF file from here: <a href="http://hubblesite.org/newscenter/archive/releases/2004/32/image/d/warn/" rel="nofollow">http://hubblesite.org/newscenter/archive/releases/2004/32/im...</a>2) From the OSX terminal:$ time sips -Z 150 hs-2004-32-d-full_tif.tifResulting file size is 150x150px and 32KB.Time to run process: 1.48s user 0.15s system 98% cpu 1.663 totalThe superpower computer required for this incredible speed is a 3 year old MacBook Pro (with SSD) ;-)UPDATE:I created 10 copies of the image. Total size: 520.1 MB$ time sips -Z 150 .tifTime to process 10 images: 14.98s user 1.44s system 94% cpu 17.384 total (approximately 1.498 per image).2000000 images 1.663 seconds = 38 days.Note: This is using larger images (49.6MB vs the 25-30MB images you have) and is using a single MacBook Pro. Divide amongst a few machines and be done in a week. GraphicsMagick could possibly be even faster.ANOTHER UPDATE:Since your files are smaller (20-30MB), I found a 30MB jpg sample here: <a href="http://sto-rvlt-01.sys.comcast.net/speedtest/random4000x4000.jpg" rel="nofollow">http://sto-rvlt-01.sys.comcast.net/speedtest/random4000x4000...</a>Time to process was much faster: 0.37s user 0.04s system 97% cpu 0.418 total2000000 images * 0.418 seconds = 9.67 daysAt that speed we are talking under 10 days. On a single 3 year old MacBook Pro. Using built-in software. Without any optimizations. Find 5 computers in the NY Public Library and you'll be done in 2 days.I am not saying this is the fastest solution. I posted this because it appears that people are over-engineering this problem or proposing solutions which will cost a lot of money (Amazon, bandwidth, shipping hard drive fees, etc).

评论 #8277379 未加载

vitovitoover 10 years ago

The NY Times did this using Amazon Web Services to process all their TIFFs (into PDFs, but thumbnailing isn't really different). They uploaded 4TB of data to Amazon over the internet, but today you could just send them a hard drive and they'll copy it onto S3 for you.NYT: <a href="http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?_php=true&_type=blogs&_r=0" rel="nofollow">http://open.blogs.nytimes.com/2007/11/01/self-service-prorat...</a> and <a href="http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/" rel="nofollow">http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-...</a>AWS Import/Export: <a href="http://aws.amazon.com/importexport/" rel="nofollow">http://aws.amazon.com/importexport/</a>

jcanycover 10 years ago

First consider thumbnailing these in the fly as they're requested by the user. But if you must mass convert them, this job must be parallelized among many compute nodes; overly tuning the processing of each conversion probably won't be very fruitful.You probably want to move these images to AWS S3 and run compute jobs and upload the resulting images back up to S3. You could create AWS Simple Queue Messages with the S3 URLs of each of the images and pop messages and autoscale EC2 instances based on the depth of that queue. What's the plan for these files after they're processed?I am local, have deep AWS experience and have next week off if you'd like some pro bono advice.50TB S3 ~$1500/month SQS ~$0.50/million messages

zacman85over 10 years ago

We would be happy to help at imgix. We currently process tens of millions of images per day, including large images like yours. Feel free to contact me at chris (at) imgix (dot) com. Link: <a href="http://www.imgix.com" rel="nofollow">http://www.imgix.com</a>

orr94over 10 years ago

<pre><code> At 5 seconds an image using a tool like ImageMagick (which might be optimistic) </code></pre> Are you sure ImageMagick would take that long? I haven't timed it, but I don't recall thumbnail creation with ImageMagick taking that long.Also, if you can parallelize it, it won't take 115 days.

评论 #8276058 未加载

评论 #8276043 未加载

cjbprimeover 10 years ago

Could you post at least one such image here with the thumbnail size you want? Then we can compete on ideas using actual numbers.

no_futureover 10 years ago

I wrote a simple CUDA based image thumbnailing dongle with OpenCV a while back. <a href="https://github.com/NealP/cudathumb" rel="nofollow">https://github.com/NealP/cudathumb</a> It's pretty fast but doesn't expose a good interface via console so if someone would like to contribute that I would be not unpleased. Also you need to have OpenCV compiled with CUDA to use it(which is kind of a nightmare). GraphicsMagick with its multithreaded thing(it should just werk if you enable it) is pretty fast if you run it on a decent CPU, it shouldn't take nearly 115 days. If thats still not fast enough for you you could try to make something with the Intel Performance Primitives package(if you're on Intel CPUs), though the cognitive load imposed by writing it might not be worth whatever speed boost it grants.

评论 #8277405 未加载

richm44over 10 years ago

Don't fork a new process for each image, write something that uses the imagemagick library directly (or another library if you prefer). Don't do them serially, use threads so you can make use of all your cores.That said, even serially 5 seconds per image seems very slow. Are you sure you're not hitting network latency from a remote filesystem or something? If so do some bulk copies to get the data locally then work on the local copy.

brudgersover 10 years ago

This sounds like a job that only has to be done once [at this scale]. 115 machine days is a lot of computing. It's not much human time. What really counts as optimization?I suspect it's mostly the quality of the access points and less the efficiency of the algorithm making thumbnails.I suspect that the secondary optimization is how quickly those access points become available. In the 17 hours between the time your question was posted and this comment, more than 12,000 images could have been thumbnailed and possibly 12,000 additional resources made available or meta-data records improved.The great thing about starting with something slow that works is that there is plenty of time to improve it [and plenty of time to decide if it really needs improving].If the process had started at 8am yesterday, more than 50,000 images would have been converted by 8am Monday morning.Now in the real world of bureaucracy, there can be more friction entailed in obtaining a box, sticking it in the corner, and letting it run for four months, than spending substantial human time researching and implementing a solution that looks clever and elegant. But really, the hardware for this job, even if purchased new is only a few hundred dollars.

cjbprimeover 10 years ago

Hey, if you're a Public Library, can't you just upload them all to Flickr, let Flickr thumbnail them, and then download the thumbnails? :)

评论 #8277054 未加载

jenkstomover 10 years ago

You could "rent" space by setting up by-the-hour cloud services, but you'd probably just trade a compute problem for a bandwidth problem. You might want to make sure you know where the bottleneck is before trying to add resources - if it's your LAN you can save time by moving closer to the files and adding a faster connection. But most likely it is CPU or memory.Can you borrow some servers from a local computer store for a few weeks? It's not outside the realm of possibility. Call them all, they can only say "no". Or maybe some businesses in the area might have some spare resources. I'd say leverage your status as a pro bono organization for some goodwill help.

评论 #8276087 未加载

garethspriceover 10 years ago

Get a benchmark for converting a single image. Use "strace -t" (or similar on your chosen OS) to see where the bottlenecks are occurring at each stage in the program's execution.This is a linear time (O(n)) problem with a large set, so it's worth the effort to shave a few milliseconds where you can as each millisecond optimized will be multiplied 2-million-fold (about 35 minutes). Once you have an optimal configuration for single images, test a small set, then let it loose on the whole set. If you can shave off 2 seconds for a single image, that's 46 machine days right there.Can you buffer the images onto a ramdisk during conversion? Guessing HDD IO will be a large bottleneck.Be sure to run your single image test on different images, so you don't get false optimization positives due to various I/O caches.What's the maximum number of images that ImageMagick will take in as a batch list? (Guessing it's somewhat short of 2M) Whatever it is, make sure to run as large a list as possible. There's a suggestion at <a href="http://www.imagemagick.org/Usage/files/#image_streams" rel="nofollow">http://www.imagemagick.org/Usage/files/#image_streams</a> but it re-initializes IM each time which sounds slow (still, can put the binaries on a ramdisk?)You want to create a stream / "tape head" type setup where files are being processed with minimum need to re-init the conversion program. But it looks like IM6 doesn't support this so, with a sampleset of that size, you may even want to look into coding up a simple C program using libtiff/libjpeg that's sole job is to run the conversion as a stream, if you have access to such skills. It may be faster than a large general purpose tool.Simple parallelism - create the list, split it into N (ImagickMaxNum) input list files, run on N workstations to reduce the total problem time by O/N. True parallelism (network queue-based) may be worth exploring using a queue system (RabbitMQ?) but don't try to write it yourself.There may be situations where it makes sense to access the files via a filename mask if you can rename them (img_0 -> img_2000000) so you don't have to store and parse the file list and can use a simple increment counter.Hope this helps! I'm no optimization guru and the above is very top-of-mind, but I enjoy these large problem sets. I'm also in NYC, would love to help out the NYPL and would volunteer some free time to do so (I need a useful side project). PM me if you'd like to talk further!EDIT: Is this the wrong problem to tackle entirely? Can you convert the set on demand and cache-as-you-go? ie. if it's book covers that people are browsing, the first user may wait 5 seconds for an image, but that's not awful... Is there a particular reason to want to create precached derivatives?

Someone1234over 10 years ago

If you do them one at a time it is 5 seconds. And the majority of that 5 seconds is likely IO wait times. You should invest in an SSD, and should also look at running multiple conversions concurrently so that the IO on the SSD is near capped.Did you see this also: <a href="http://www.graphicsmagick.org/index.html" rel="nofollow">http://www.graphicsmagick.org/index.html</a>They claim to be faster than ImageMagic.

评论 #8277021 未加载

jscheelover 10 years ago

Use VIPS (<a href="http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS" rel="nofollow">http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS</a>) for large TIFF conversion. It should handle them much, much faster. Also, I think there was a write-up on highscalability.com about instagram, or pinterest, or someone resizing a ton of images a while ago.

exelibover 10 years ago

My question is: Do you need all thubnails at once? I setted up a images album for kindergarten for my child with more than 17GB (>4000) jpeg's. I used nginx webserver for resizing on the fly and then let nginx cache it. In a VM with 4 cores from i7-920 it thumbnail 2-6 images per second. After first access images are taken from the cache.

perceptover 10 years ago

In RubyLand I used to use ImageScience (<a href="http://docs.seattlerb.org/ImageScience.html" rel="nofollow">http://docs.seattlerb.org/ImageScience.html</a>), which is apparently based on FreeImage:<a href="http://freeimage.sourceforge.net/" rel="nofollow">http://freeimage.sourceforge.net/</a>

eabrahamover 10 years ago

A few months ago I did benchmarks on ImageMagick and similar libraries for a Rails application I was working on. I discovered Vips which has a significant performance improvement over Imagemagick.<a href="https://github.com/jcupitt/ruby-vips" rel="nofollow">https://github.com/jcupitt/ruby-vips</a>

pixl8edover 10 years ago

<a href="http://codeascraft.com/2010/07/09/batch-processing-millions-of-images/" rel="nofollow">http://codeascraft.com/2010/07/09/batch-processing-millions-...</a> is a writeup of a similar task at Etsy to resize 135 million images.

kevivover 10 years ago

A simple solution: Use Gearman to queue the jobs and use supervisord to run multiple instances of the worker to process these jobs. Should do the trick.

virmundiover 10 years ago

Out of curiosity do you need then all build? Could you tumbnail as they are requested? So the first request is slow but you cache fir the future?

jason_slackover 10 years ago

just as an interesting experiment, would anyone want to collaborate on a CUDA program to do this? I'm learning to utilize it now with c++.

LarryMade2over 10 years ago

gthumb; gnome file menu - create thumbnails...

22 comments

samptempover 10 years ago

评论 #8277379 未加载

vitovitoover 10 years ago

jcanycover 10 years ago

zacman85over 10 years ago

orr94over 10 years ago

评论 #8276058 未加载

评论 #8276043 未加载

cjbprimeover 10 years ago

Could you post at least one such image here with the thumbnail size you want? Then we can compete on ideas using actual numbers.

no_futureover 10 years ago

评论 #8277405 未加载

richm44over 10 years ago

brudgersover 10 years ago

cjbprimeover 10 years ago

Hey, if you're a Public Library, can't you just upload them all to Flickr, let Flickr thumbnail them, and then download the thumbnails? :)

评论 #8277054 未加载

jenkstomover 10 years ago

评论 #8276087 未加载

garethspriceover 10 years ago

Someone1234over 10 years ago

评论 #8277021 未加载

jscheelover 10 years ago

exelibover 10 years ago

perceptover 10 years ago

eabrahamover 10 years ago

pixl8edover 10 years ago

kevivover 10 years ago

A simple solution: Use Gearman to queue the jobs and use supervisord to run multiple instances of the worker to process these jobs. Should do the trick.

virmundiover 10 years ago

Out of curiosity do you need then all build? Could you tumbnail as they are requested? So the first request is slow but you cache fir the future?

jason_slackover 10 years ago

just as an interesting experiment, would anyone want to collaborate on a CUDA program to do this? I'm learning to utilize it now with c++.

LarryMade2over 10 years ago

gthumb; gnome file menu - create thumbnails...