Thursday, June 18, 2015

glMapBuffer No Longer Cool

TL;DR: when streaming uniforms, glMapBuffer is not a great idea; glBufferSubData may actually work well in some cases.

I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.

A long time ago I more or less wrote this:

  • When you want to stream new data into a VBO, you need to either orphan it (e.g. get a new buffer) or use the new (at the time) unsynchronized mapping primitives and manage ranges of the buffer yourself.
  • If you don't do one of these two things, you'll block your thread waiting for the GPU to be done with the data that was being used before.
  • glBufferSubData can't do any better, and is probably going to do worse.
Five years is a long time in GPU history, and those rules don't quite apply.

Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard.  Never do that!

But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.

The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:

  1. Blocking the app thread until the driver actually services the requests, then returning the result. This is bad. I saw some slides a while back where NVidia said that this is what happens in real life.
  2. In theory under just the right magic conditions glMapBuffer could return scratch memory for use later. It's possible under the API if a bunch of stuff goes well, but I wouldn't count on it. For streaming to AGP memory, where the whole point was to get the real VBO, this would be fail.
It should also be noted at this point that, at high frequency, glMapBuffer isn't that fast. We still push some data into the driver via client arrays (I know, right?) because when measuring unsynchronized glMapBufferRange vs just using client arrays and letting the driver memcpy, the later was never slower and in some cases much faster.*

Can glBufferSubData Do Better?

Here's what surprised me: in at least one case, glBufferSubData is actually pretty fast. How is this possible?

A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.

But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
if(offset == 0 && size == size_of_currently_bound_vbo)
glBufferData(target,size,NULL,last_buffer_usage);
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.

What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use.  By keeping the results of glMapBuffer private, we can avoid a stall.

(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)

Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.

A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).

Using glBufferSubData on Uniforms

The test case I was looking at was with uniform buffer objects.  Streaming uniforms are a brutal case:

  • A very small amount of data is going to get updated nearly every draw call - the speed at which we update our uniforms basically determines our draw call rate, once we avoid knuckle-headed stuff like changing shaders a lot.
  • Loose uniforms perform quite well on Windows - but it's still a lot of API traffic to update uniforms a few bytes at a time.
  • glMapBuffer is almost certainly too expensive for this case.
We have a few options to try to get faster uniform updates:

  1. glBufferSubData does appear to be viable. In very very limited test cases it looks the same or slightly faster than loose uniforms for small numbers of uniforms. I don't have a really industrial test case yet. (This is streaming - we'd expect a real win when we can identify static uniforms and not stream them at all.)
  2. If we can afford to pre-build our UBO to cover multiple draw calls, this is potentially a big win, because we don't have to worry about small-batch updates. But this also implies a second pass in app-land or queuing OpenGL work.**
  3. Another option is to stash the data in attributes instead of uniforms. Is this any better than loose uniforms? It depends on the driver.  On OS X attributes beat loose uniforms by about 2x.
Toward this last point, my understanding is that some drivers need to allocate registers in your shaders for all attributes, so moving high-frequency uniforms to attributes increases register pressure. This makes it a poor fit for low-frequency uniforms. We use attributes-as-uniforms in X-Plane for a very small number of parameters where it's useful to be able to change them at a frequency close to the draw call count.

I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.



* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable.  The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.

* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early.  OpenGL's implicit flush makes this impossible.

Wednesday, June 10, 2015

OS X Metal - Raw Notes

Just a few raw notes about Metal as I go through the 2015 WWDC talks. This mostly discusses what has been added to Metal to make it desktop-ready.

Metal for desktop has instancing, sane constant buffers, texture barrier, occlusion query, and draw-indirect. It looks like it does not have transform feedback, geometry shaders or tessellation. (The docs do mention outputting vertices to a buffer with a nil fragment function, but I don't see a way to specify the output buffer for vertex transform. I also don't see any function attach points for geometry shaders or tessellation shaders.)

The memory model is "common use cases" - that is, they give you a few choices:
  • Shared - one copy of the data in AGP memory (streaming/dynamic in OpenGL). Coherency is apparently at the command buffer, not continuous, so this apparently does not rely on map-coherent (and I would assume has lower overhead).
  • Managed - what we have now for static geometry: a CPU side and GPU side copy that are synced. Sync is explicit by flushing CPU side changes (via didModifyRange).  Reading back changes from the GPU is explicit -and- queued via a synchronizeResource call on the blit command encoder. For shared memory devices (e.g. Intel?) there's only one copy (e.g. this backs off to "shared").
  • Private - the memory is entirely on the GPU side (possibly in VRAM) - the win here is that the format can be tiled/swizzled/whatever is fastest for the GPU. Therefore this is the right storage option for framebuffers.  Access to/from the data is only from blit command encoder operations.
  • Auto - a meta-format for textures - turns into shared on IOS (which has no managed) and managed on desktop - so that cross-platform code can do one thing everywhere. (This seems odd to me, because desktop apps will want to have some meshes be managed too.)
The caching model (e.g. write-combined) is a separate flag on buffer objects. Mapping is always available and persistent.

There appears to be no access to the parallel command queues that modern GCN2 devices have. Modern GPUs can run blit/DMA operations in parallel with rendering operations; this uses the hardware more efficiently but also introduces an astonishing amount of complexity into APIs like Mantle that tell developers "here's two async queues - good luck staying coherent."

Perhaps Metal internally offloads blit command buffers to the DMA queue and inserts a wait in the rendering encoder for the bad-luck case where the blit doesn't finish enough ahead of time.

Unlike Mantle, there is no requirement to manually manage the reference pool for the resources that a command queue has access to - this also simplifies things.

Finally, Mantle has a more complex memory model; in Mantle, you get big pools of memory from the driver and then jam resources into them yourself, letting games create pool allocators as desired. In Metal, everything's just a resource; you don't really know how much VRAM you have access to or how stuffed it is, and paging the managed pool is entirely within the driver.  (One exception: you can create a buffer directly off of a VM page with no copy, but this still isn't the same as what Mantle gives you.)

As the guy who would have to code to these APIs, I definitely like Apple's simpler, more automatic model a lot more than the state change rules for Mantle; naively (having not coded it) it seems like the app logic to handle state change would be either very complex and finicky or non-optimal. But I'd have to know what the cost in performance is of letting the driver keep these tasks. Apple's showing huge performance wins over OpenGL, but that doesn't validate the idea of driver-managed resource coherency; you'd have to compare to Mantle to see who should own the task.

I can imagine AAA game developers being annoyed that there isn't a pooling abstraction like in Mantle, since the "pool of memory you subdivide" model works well with what console games do on their own to keep memory use under control.

Overall, based on what I've read of the API, Metal looks a solid, well-thought-out next generation API; similar to Mantle in how it reorganizes work-flow, but less complex. It's still missing some modern desktop GPU functionality, but moving Metal to discrete hardware with discrete memory hasn't turned the API into a swamp.

Finally, from an adoption stand-point: it looks to me that Metal on OS X gives Apple a way to try to leverage its strong position in mobile gaming to move titles to the desktop, which is a more viable sell than trying to use a more modern OpenGL to move titles from PC. (Only having a DirectX clone would help with that.)