The "fade", "grad" and the "lerp" are written as member functions (so there is some "this" pushing), yet there is no need for so. There might be some speedup (overall) inlining them.<p>The author claimed it's straight JAVA translation from Ken Perlin's code, yet he managed to drop the "static" for the above three functions.<p>EDIT: Some more problems.<p>Since perlin noise is perfect candidate for data-parallel split, it would much more efficient to split the job where instead of creating one image / thread, you split one image among all threads and process them individually.<p>This way memory usage is reduced - e.g. you are processing one image, not N. For very big images it might be even more efficient to limit processing of one thread to a 16x16 or 256x1 image blocks.<p>There is another Java->C++ problem with the code in this case - since you instantiate the p[] array (not static again), this means that each worker would have it's own array, hence wasting L1 cache (the p[] data is all the same).<p>Overall this task is more suited for OpenMP (data-parallel), than the async stuff (but thanks for demonstrating it, I was not familiar)
"As a side note, the parallel version uses about 280 threads on my machine vs a single thread for the serial version."<p>This is a great example of how you shouldn't run things in parallel. Running 280 Threads on 2-4 cores machine makes no sense at all.