Wednesday, September 10, 2008

So Where Is That Fast Path?

Amongst the many rants I've read about the new OpenGL 3.0 spec, is the claim that the spec needs to be cleaned out and rebuilt so that people can "find the fast path".

What application developers are getting at with this is that OpenGL is a rich API, and not all cards do everything at top speed.  There is usually one fastest way to talk to OpenGL to get maximum throughput.

The problem is: this is ludicrous.  Case in point, the GeForce 8800 - in my Mac Pro, running OS X 10.5.4 and Ubuntu 8.  So what is the fast path?

If I draw my terrain using stream VBOs for geometry and indices that are not in a VBO, I get 105 fps on Linux.  If I then put the indices into a stream VBO, I get 135 fps.  The fast path!

Well, not quite.  The index-without-VBO case runs at 73 fps on OS X, but once those indices go into a VBO, I crash down to 25 fps.  Wrong fast path.

Simply put, you can't spec the fast path, so the spec doesn't matter.  You find the fast path by writing a lot of code and trying it on a lot of hardware.  I can't see there ever being another way, given how many different cards and drivers are out there.

Monday, September 08, 2008

Geometry Shader Performance on the 8800

So this is what I learned today: when it comes to geometry shaders and the 8800, triangle strips matter. Now after you read the details, this will seem so obvious that you can only conclude that I am a complete dufus (something I will not necessarily dispute). But the 8800 (like most modern cards) is so bloody fast that triangle strips are actually not a win in almost all other configurations.

The test: a mesh of 1000 x 1000 quads (each in turn is two triangles), being rotated. Using a single static vertex buffer with static indexes, this runs at around 50-55 fps. Each vertex has 8 components (XYZ, normal, texture ST).

Now some numbers:
  • The baseline is around 54 fps.
  • Cutting the geometry to a 500x500 mesh brings us to around 204 fps, which is what we expect for a vertex-bound operation. The pixel shading has been kept intentionally simple to achieve this result.
  • Using a geometry shader which simply passes through the geometry has no affect on fps.
  • Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting 12 vertices and 4 primitive (e.g. tris) ends runs at a creeping 25 fps.
  • Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting eight vertices and 2 primitives (e.g. strips) runs at 68 fps.
  • When using this strip-based geometry shader, sorting the mesh indices by strip format (e.g. 0 1 2 2 1 3 2 3 4 4 3 5) improves fps to 73 fps or so. When not using the geometry shader, this strip sorting has no impact
Let's tease that mess apart and see what it means. Basically my goal was to test the performance of "dynamically created" geometry (e.g. creating more vertices from less using a geometry shader) vs. "mesh updating" (e.g. periodically re-tessolating the mesh and saving the results to new VBOs. The later technique's best performance is simulated by the 1000x1000 VBO in VRAM; the former by the geometry shader.

As you can see, geometry shaders can outperform straight VBO drawing, but only if they are set up carefully. In particular, you can't have multiple-separate-triangle primitives in a geometry shader output, so if we want to draw distinct triangles, we have to end a lot of primitives. There is also no vertex indexing out the back of a geometry shader, so strips are a win.

(Contrast this to drawing out of a VBO - with indexing and multiple triangles per call, and a huge cost to restarting primitives, GL_TRIANGLES indexed is usually faster than strips.)

What's surprising here is not that strips are faster in the geometry shader, but that they are so much faster! With strips we've cut down the geometry data by about 30% (from 12 vertices to 8), but we get an almost 3x improvement in throughput. My theory is that emitting fewer primitives is what wins; we've cut down geometry and cut the number of primitives in half.

The moral of the story is: it pays to find a way to strip-ify the output of geometry shaders.