I wonder how this is implemented in the GPU. From my time working on a 3D renderer a long time ago, triangles with offscreen vertices would be clipped into smaller triangles, so in the end you would still be rendering multiple triangles anyway. I imagine it would also be possible to clip the scanlines instead.
A bigger reason to do this is that on some (shoddy) hardware, the user sees a tear line along the diagonal of the triangles.<p>It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.
<p><pre><code> In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
</code></pre>
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.<p>I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.
> In my microbenchmark1 the single triangle approach was 0.2% faster than two.<p>Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the <i>average</i> of the timings, instead of something that would visualize the distribution, like a histogram or KDE.
This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?<p>The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.
> In my microbenchmark1 the single triangle approach was 0.2% faster than two. We are definitely deep into micro-optimization territory here :)<p>In the 3D graphics space, this kind of knuckle-shaving is deeply revered!
Would this still be true on a tiled rendering GPU, i.e. mobile?<p>If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?