Batch Processing Millions and Millions of Images

80 pointsby mattybalmost 15 years ago

11 comments

lars512almost 15 years ago

The problem of resizing all these images is an "embarrassingly parallel" one, right? You don't care about how fast any individual image is resized, only how fast they're resized in aggregate, and each image is a nice small chunk of work.The author spends time tuning the number of workers and the number of OpenMP processes per worker for GraphicsMagick on his 16-core machines. Isn't this type of tuning a waste of time? Even using just two cores instead of one seems to introduce inefficiency. Wouldn't he have been better off just using 16 workers, each compiled without OpenMP so that it would run serially (and more efficiently)?

评论 #1503374 未加载

评论 #1505315 未加载

bockrisalmost 15 years ago

Several years ago I had to do something similar (10 of thousands instead of millions though).Our problem was that we were making a composite image but sometimes the 'background' image content was shifted to the left or right. (e.g. All images were the same size with a white background but some were made from different source template and so the actual content sometimes started at pixel 25 and other times at pixel 35.) To make it worse they were all JPEGS.I ended up writing a program to find the bounding box of the content (with a fudge factor to account for the JPEG dithering) and identified the bad images to be fixed by hand. It was a nice diversion from SQL and ASP.

评论 #1502278 未加载

liuliualmost 15 years ago

Call me naive, but I think for any serious processing, you need to dig into the actual underlying algorithm and implementation to made difference. The hardware difference between generations are huge, different arch can have a big impact in term of performance (really down to the detail: cache-line, bandwidth, and SIMD inst, image processing is sensitive to all these above). Tuning your custom implementation can suddenly be worthwhile.

pilifalmost 15 years ago

Mmh"In fact, the research phase of this project took longer than the batch processing itself. That was clearly time well spent."I'm not so sure about this. A sufficiently parallelized but otherwise unresearched script could have been created in a fraction of the time. So you would just generate the correct thumbnails on new uploads and then let the batch job run for, say, two weeks. That naive approach will take you around a days worth of development time.The time you gained by doing it the naive way you then put into the rewrite of that legacy component that prevented you from on-the-fly generating the images.What you have done here is, IMHO, wasted time in a legacy solution and you wasted hardware resources for storage of the additional pictures of which probably only a minority will actually be seen anyways.Don't get me wrong: I'm sure you had a lot of fun and there's little more exciting things than seeing your code perform magnitudes quicker than the initial approach. But in the context of your quote, I have to disagree.Unless there's more background you didn't tell us about.

callmeedalmost 15 years ago

Good, detailed post. I just did about 250K images using ImageMagick and a shell script. It was across 3 boxes and I noticed quite a performance difference between an older and newer version of IM. It could have been another factor, I suppose.One thing I'm skeptical about:"We found out, almost by accident, that using the previously down-sized “large” images resulted in better quality and faster processing than starting with the original full-size images."Obviously, the processing speed will be faster with a down-sized image, but I can't see how starting with a smaller image will give you better quality. Unless the original image was so large that downsizing caused a lot of detail loss (in which case the image should have been sharpened).

评论 #1502703 未加载

patio11almost 15 years ago

If you want to make the "pick which setting is better" optimization a wee bit better, it is about two hours of work to rig up "Which do you like: image A, image B, or 'They're the same'?" and then simulated annealing your way to victory. Randomize which side you place the "better" setting on, obviously.I'd almost be tempted to spend a few more hours and hook it up to mechanical turk rather than actually doing the classification myself. (Partially because I get bored easily and partially because I have all the artistic appreciation of a mole rat.)

cagefacealmost 15 years ago

Did you investigate resizing them on the fly with some kind of caching layer? As one of the posters notes, it seems likely that a lot of those images will be very rarely seen, if ever.

评论 #1502895 未加载

natchalmost 15 years ago

Nice article. FWIW Perl is often called the "Swiss army chain saw," not "Swiss army knife." Thanks for sharing your experience.

spidamanalmost 15 years ago

Hate to be the Monday morning quarterback but hey, it's Saturday morning and I'm stuck in a Starbucks in Benicia. So, this is a nice walk through the optimization process but it is a fundamentally unscalable system. When the workload triples next year, does it make sense to scale up the hardware? (No)This system doesn't scale out without ad-hoc partitioning. I think Etsy's approach here should have been informed by the New York Times' project a few years ago. They converted 4 TB of scanned tiffs (their article archive) into PDFs on a hadoop cluster running on ec2 and s3. Parallelizing the process across hardware nodes is the scale free way to do this, exactly what hadoop was intended for.All that said, I know there are smart folks at Etsy. I'll give 'em the benefit of the doubt; there may be a good reason not to go that route but this write up didn't make that clear.

评论 #1503851 未加载

DanielBMarkhamalmost 15 years ago

I hate to say this, but for some reason I can't help myself.This looks like a problem for which I could write the code in 15 minutes. Assuming Markham's Rule of Estimating (double the number and go to then next larger unit), that's about 30 hours of work.This is a problem just begging for a functional solution. If you've got a language that already has a bunch of libraries, like .NET, you just wire it up, pipeline it, and send it out to as many workers as you want. The image tweaking, worker nitpicking stuff just isn't worth the time. If you're doing millions of everything, then I'm sure you've already got some broker parallelization thing going on. Shouldn't need to redo that every time. If you want to be truly anal, do some tests to determine the fastest way to compress. But aside from that, it's all big old hunks of immutable data splaying out into the universe, running on as many cores as you need.I probably missed something. These things always look simpler from the outside.

ewamsalmost 15 years ago

Why did you do this as a batch job? Do you run batch jobs instead of resizing them when a user uploads the pictures? Thanks for sharing.

评论 #1502802 未加载