TechEcho

9 comments

ddrenabout 2 years ago

I wonder how this is implemented in the GPU. From my time working on a 3D renderer a long time ago, triangles with offscreen vertices would be clipped into smaller triangles, so in the end you would still be rendering multiple triangles anyway. I imagine it would also be possible to clip the scanlines instead.

评论 #35076792 未加载

londons_exploreabout 2 years ago

A bigger reason to do this is that on some (shoddy) hardware, the user sees a tear line along the diagonal of the triangles.It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.

评论 #35076224 未加载

oblabout 2 years ago

<pre><code> In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse. </code></pre> While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.

评论 #35076281 未加载

评论 #35075867 未加载

ttoinouabout 2 years ago

Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible

评论 #35075934 未加载

评论 #35075781 未加载

评论 #35075937 未加载

评论 #35075866 未加载

评论 #35078144 未加载

评论 #35077672 未加载

nsajkoabout 2 years ago

> In my microbenchmark1 the single triangle approach was 0.2% faster than two.Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.

评论 #35076666 未加载

评论 #35076722 未加载

lukkoabout 2 years ago

This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.

评论 #35082209 未加载

teucrisabout 2 years ago

> In my microbenchmark1 the single triangle approach was 0.2% faster than two. We are definitely deep into micro-optimization territory here :)In the 3D graphics space, this kind of knuckle-shaving is deeply revered!

ladon86about 2 years ago

Would this still be true on a tiled rendering GPU, i.e. mobile?If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?

评论 #35075701 未加载

ww520about 2 years ago

That's a pretty neat trick, letting the GPU to exclude the out of bound regions of the enlarged triangle and only render the visible rectangle.

9 comments

ddrenabout 2 years ago

评论 #35076792 未加载

londons_exploreabout 2 years ago

评论 #35076224 未加载

oblabout 2 years ago

评论 #35076281 未加载

评论 #35075867 未加载

ttoinouabout 2 years ago

Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible

评论 #35075934 未加载

评论 #35075781 未加载

评论 #35075937 未加载

评论 #35075866 未加载

评论 #35078144 未加载

评论 #35077672 未加载

nsajkoabout 2 years ago

评论 #35076666 未加载

评论 #35076722 未加载

lukkoabout 2 years ago

评论 #35082209 未加载

teucrisabout 2 years ago

ladon86about 2 years ago

Would this still be true on a tiled rendering GPU, i.e. mobile?If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?

评论 #35075701 未加载

ww520about 2 years ago

That's a pretty neat trick, letting the GPU to exclude the out of bound regions of the enlarged triangle and only render the visible rectangle.

Full screen triangle optimization

9 comments

Full screen triangle optimization

9 comments