Implementing Mario's Stack Blur 15 times in C++ (with tests and benchmarks)

111 pointsby wlllover 1 year ago

13 comments

abbeyjover 1 year ago

> Each time the [std::deque] slides, every value in the queue has to change position in memoryI don't think that this is true. `std::deque` has guaranteed O(1) complexity for insertions or deletions at the beginning or end of the container. It couldn't provide this if it was moving every value in the queue.> How do we alter the stackSum when the queue moves?You never quite answer this question. How are `sumOut` and `sumIn` related to `stackSum`? This is available in the code but I feel like there should at least be a one-line summary available without having to click through to the code.> sumOut is what we need to add to the stack.This is supposed to be `sumIn`?The examples that you provide use a radius of 2, thus having a `sizeOfStack` of (radius+1)*2 = 9 and requiring a division by 9 for each pixel. Have you considered using a radius of 1 or a radius of 3 instead? This would require division by (1+1)*2 = 4 or (3+1)*2 = 16 respectively which could then be done with a bitshift instead of a division. You'd have to template the code on `radius` instead of passing it in as a runtime parameter so that the compiler could lower the divisions to bitshifts.

评论 #38221533 未加载

bhoustonover 1 year ago

2D Radial blur is a separable filter [1], which means you can calculate it via first doing a 1D blur in one direction, and then using that intermediate result and doing a 1D blur in the other direction and it is equivalent to doing a 2D radial blur.Could one modify the stack blur algorithm to also operate in a separable fashion? It seems to me that one can combine Stack Blur with the separable concept to get the best of both worlds.Of course traversing the data in a way not aligned with its in-memory data layout may cause issues, so one would have to figure out an efficient way to do that. I can think of a few ways of doing that using prefetching.[1] <a href="https://en.wikipedia.org/wiki/Separable_filter" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Separable_filter</a>

评论 #38219592 未加载

corysamaover 1 year ago

Probably would have been much easier to do 15 times in <a href="https://halide-lang.org/" rel="nofollow noreferrer">https://halide-lang.org/</a>The idea behind Halide is that scheduling memory access patterns is critical to performance. But, access patterns being interwoven into arithmetic algorithms makes them difficult to modify separately.So, in Halide you specify the arithmetic and the schedule separately so you can rapidly iterate on either.

评论 #38223603 未加载

TylerGlaielover 1 year ago

Biggest place for improvement performance-wise I can see from this is that you aren't taking into account cache coherency here. Horizontal blurs with this method are going to be ripping fast, but vertical blurs are going to constantly cache miss. There's a number of ways to potentially fix this, the fact that you only need to store 3 sums for one line means you could do a few (64) columns at a time and just store that many distinct sets of sums (sorta like doing 64 columns in parallel, just making sure to use the data thats in the cache while its still in the cache)

评论 #38223547 未加载

BryanLegendover 1 year ago

I was so disappointed that this had nothing to do with the new Mario Wonder game. :(

评论 #38219902 未加载

bitwizeover 1 year ago

Every time I see an unadorned "Mario" I think of the video game plumber. So I was a bit stoked to see some sort of algorithm for a special effect for Super Mario Bros. Wonder that escaped my notice, maybe a way to blur the player sprite for invincibility power-up effects by stacking multiple copies of the sprite of varying opacity on top of one another?Well, regardless, today I learned.

shoveover 1 year ago

Mario is absolutely legendary in the creative coding space. Everything he does has that “how the hell did he do that?!” vibe, from the Flash/ActionScript stuff back in the day to some of the only actually interesting ML image generation work I’ve seen.

dxhdrover 1 year ago

One thing the article didn't mention is that a gaussian blur can be approximated with multiple passes of a box blur. Not sure how that would relate performance-wise to the discussed stack blur algorithm. The code would probably be a bit simpler, anyway.

评论 #38223661 未加载

评论 #38220903 未加载

sebastianmestreover 1 year ago

In my experience, 3 passes of box blur looks good enough for small kernels (less than 50px). Is this faster than that?

评论 #38221852 未加载

anonymoushnover 1 year ago

It seems like for the 1d case you should be able to compute the double-prefix sum (the prefix sum of the prefix sum of the input row) and then add 2 elements of that array and subtract 2 other elements to get the output pixel.Here are notes about how to compute the prefix sum, which can be adapted to compute the double-prefix sum: <a href="https://en.algorithmica.org/hpc/algorithms/prefix/" rel="nofollow noreferrer">https://en.algorithmica.org/hpc/algorithms/prefix/</a>Since you access members of the double-prefix sum array in a fixed pattern you can also use SIMD here, i.e. load x4, add, add, sub, divide by a constant = (mul, shift), shuffle, store to compute 8 pixels on AVX2. This assumes you have a radius > 15 and are using u32 elements, for <= 15 you can use u16 elements in the double-prefix sum array, but I'm not sure if this is a win because I don't know how to do implement the divide step. It's cheap to do only part of the row and pick up the double-prefix sum computation later, so length of the buffer you work on should be chosen to fit in L1.This technique should work with padding: just pad the array before computing the double-prefix sum, so the full double-prefix sum array has n + 2r + 2 entries, but as above your work buffer should be smaller than this if the image is very wide, and the whole array doesn't need to exist at the same time. It should also work without padding: you need to do libdivide or fastmod-style computations with ~r different fixed divisors at the edge of the image. This is still worth it, you'll be doing dozens or thousands of divisions with each edgy divisor. An ugly corner case appears here when 2r > n. Sorry.Once your 1d case is much, much faster, I would expect transposing and then transposing back to be a clear win over not transposing at all, because SIMD gather instructions are dreadful and if you want to index individual bytes with a large stride they don't even exist.I'm interested in comparing this to the approach used in TFA, but the github repository seems to be missing a bunch of header files...As an aside, some other comments mentioned that stack blur is "a box blur of a box blur" and the "prefix sum of a prefix sum" idea in this comment corresponds to that "box blur of a box blur" idea. If you wanted, it wouldn't be that much more expensive to add 1 more level of prefix summing (= 1 more level of box blur) to get a kernel that looks much more like a gaussian.

LoganDarkover 1 year ago

Oh hey, I'm credited in the article! However, that rustrepo website you linked is a really bad mirror of the actual GitHub repo: <a href="https://github.com/LoganDark/stackblur-iter">https://github.com/LoganDark/stackblur-iter</a>Please fix :)AMA

评论 #38222770 未加载

Tempest1981over 1 year ago

How were the images in this web page created? Nicely done.

评论 #38220389 未加载

lionkorover 1 year ago

instead of a deque or circular buffer, wouldnt it make more sense to use a span?

评论 #38220436 未加载