TechEcho

7 comments

As I just told a colleague who told me of this post, "I'm so old I don't even know how to use that [this] site." :)Happy to answer questions or address criticisms.

评论 #26537563 未加载

评论 #26537001 未加载

评论 #26539125 未加载

评论 #26537157 未加载

评论 #26538865 未加载

评论 #26539322 未加载

评论 #26540739 未加载

marcodiegoabout 4 years ago

Previous discussion: <a href="https://news.ycombinator.com/item?id=10404517" rel="nofollow">https://news.ycombinator.com/item?id=10404517</a>

评论 #26541170 未加载

评论 #26537668 未加载

0-_-0about 4 years ago

Isn't the magic kernel simply a 2× bilinear downscale filter? For example, if you use 2× bilinear downscaling in Tensorflow (as opposed to bilinear downsampling which would end up being just a box filter) using tf.image.resize it uses the magic kernel [1,3,3,1].There is an article [0] that points this out, although later it goes into recommending odd filters, which insidiously break many applications where you would want to use downscaling by shifting the image by a fraction of a pixel.[0] <a href="https://cbloomrants.blogspot.com/2011/03/03-24-11-image-filters-and-gradients.html" rel="nofollow">https://cbloomrants.blogspot.com/2011/03/03-24-11-image-filt...</a>

评论 #26541000 未加载

peter_d_shermanabout 4 years ago

>"In early 2015 I numerically analyzed theFourier propertiesof the Magic Kernel Sharp algorithm, and discovered that it possesses high-order zeros at multiples of the sampling frequency, which explains why it produces such amazingly clear results: these high-order zeros greatly suppress any potential aliasing artifacts that might otherwise appear at low frequencies..."[...]>"Following that work, I analytically derived theFourier transformof the Magic Kernel in closed form, and found, incredulously, thatit is simply the cube of the sinc function.This implies that the Magic Kernel is just therectangular window function convolved with itself twice-which, in retrospect, is completely obvious. This observation, together with a precise definition of the requirement of the Sharp kernel, allowed me to obtain an analytical expression for the exact Sharp kernel, and hence also for the exact Magic Kernel Sharp kernel, which I recognized is just the third in a sequence of fundamental resizing kernels. These findings allowed me to explicitly show why Magic Kernel Sharp is superior to any of the Lanczos kernels. It also allowed me to derive further members of this fundamental sequence of kernels, in particular the sixth member, which has the same computational efficiency as Lanczos-3, but has far superior properties."Paper: <a href="http://www.johncostella.com/magic/mks.pdf" rel="nofollow">http://www.johncostella.com/magic/mks.pdf</a>PDS: My thoughts: Absolutely fascinating! (Also, that the the "Magic Kernel is just the rectangular window function convolved with itself twice" -- was not obvious to me, so it's good you included that!)

评论 #26537902 未加载

londons_exploreabout 4 years ago

You imply that this is implemented over gamma compressed data... Doesn't that make all the analysis you have done void?

评论 #26553468 未加载

kragenabout 4 years ago

Hmm, aside from my note in <a href="https://news.ycombinator.com/item?id=26543937" rel="nofollow">https://news.ycombinator.com/item?id=26543937</a> about how the family of kernels Costella discovered are precisely the uniform cardinal B-splines, I noticed this slight error:> Computationally, however, nothing short of Fast Fourier Transforms or hardware acceleration can make a large blurring job efficient: for large values of blur, the support of any blur kernel is large, and not practical for some at-scale applications.The standard way to do Gaussian blur is to approximate it as a B-spline, specifically a cascade of simple moving average filters, also known as a Hogenauer filter or CIC filter in the time domain. Typically a cascade of three filters is used, precisely as Costella describes. This is both simpler and faster than doing the task with a Fourier transform, and moreover it is an online algorithm — it produces output samples at some processing lag behind the input samples.As you can easily verify, the following program uses this approach to convolve its input buffer with a Gaussian kernel with support of 766 samples, constructed in just the way Costella describes (as a B-spline), in a single pass over the input buffer, with only six additions and subtractions per sample, not counting the additions and subtractions used for addressing.<pre><code> #include <stdio.h> enum {lag = 256, image_size = 4096}; unsigned long buf[image_size] = {0}; int main(int argc, char **argv) { buf[image_size/2] = 1; for (size_t i = 0; i < image_size - 3 * lag - 3; i++) { buf[i + 3 * lag + 2] += buf[i + 3 * lag + 1]; buf[i + 3 * lag + 1] += buf[i + 3 * lag]; buf[i + 3 * lag] += buf[i + 3 * lag - 1]; buf[i + 2 * lag] -= buf[i + 3 * lag]; buf[i + lag] -= buf[i + 2 * lag]; buf[i] = buf[i + lag] - buf[i]; } for (size_t i = 0; i < image_size - 3 * lag - 3; i++) { if (i) printf((i%8) ? ", " : ",\n"); printf("%6ld", buf[i]); } printf("\n"); return 0; } </code></pre> Removing the printf calls, this takes about 162'677 instruction executions to run, according to Valgrind, which is 39 instructions per pixel. But most of that is still startup and shutdown; increasing the image size by another 4096 increases the number to 232'309 instructions, only 69632 more instructions, or 17 instructions per sample. (printf takes almost two orders of magnitude more time than the blur call.)Adding an outer loop reveals that on my nine-year-old laptop it takes 6.5 μs to run over the whole buffer, or 1.6 nanoseconds per sample. In theory this performance does not depend on the lag and thus the blur radius, but I've only tested a single lag.The output samples are in some sense scaled by 2²⁴ = 16777216, or by 2¹⁶ = 65536 if you're interpolating a signal that's being upsampled by zero insertion by a factor of 256. This scale factor does depend on the lag (it is the cube of the lag).This version computes the results in place starting from output that is already in memory, but you can also use a ring buffer with a capacity of at least 3lag + 2 = 770 samples. If you're downsampling, an alternative is to decimate the samples as soon as they've passed through the cascade of three integrators at the top of the loop, thus reducing both the amount of memory needed and the number of operations to do.Overflow of intermediate results can happen, but it is not a problem as long as the final output is within range, because exact integer addition mod k is a group. If you were doing this in floating point, you might prefer to interleave the integrators with the comb filters to prevent intermediate overflow and consequent loss of precision leading to numerical instability.Only the parts of the result buffer over which all six pipeline stages have passed contain valid results. The simplest solution to this is padding, but you can also use special logic to run the integrators but not the combs over the beginning of the input.I would argue that 17 instructions or 1.6 nanoseconds per sample is always "efficient" and "practical" for nearly all "at-scale applications". If you were blurring a 3840×2160 display at 120 fps, one gigapixel per second, it would take 13 milliseconds of my laptop CPU cores per second for each of the R, the G, and the B; you could do it on about 6 cores of a modern CPU, without a GPU.

评论 #26549972 未加载

PaulHouleabout 4 years ago

I always thought the Lanzcos kernel was overrated.

评论 #26541097 未加载

7 comments

johncostellaabout 4 years ago

As I just told a colleague who told me of this post, "I'm so old I don't even know how to use that [this] site." :)Happy to answer questions or address criticisms.

评论 #26537563 未加载

评论 #26537001 未加载

评论 #26539125 未加载

评论 #26537157 未加载

评论 #26538865 未加载

评论 #26539322 未加载

评论 #26540739 未加载

marcodiegoabout 4 years ago

Previous discussion: <a href="https://news.ycombinator.com/item?id=10404517" rel="nofollow">https://news.ycombinator.com/item?id=10404517</a>

评论 #26541170 未加载

评论 #26537668 未加载

0-_-0about 4 years ago

评论 #26541000 未加载

peter_d_shermanabout 4 years ago

评论 #26537902 未加载

londons_exploreabout 4 years ago

You imply that this is implemented over gamma compressed data... Doesn't that make all the analysis you have done void?

The Magic Kernel

7 comments

The Magic Kernel

7 comments