Wednesday, August 26, 2009

Updating Textures On The Fly With Threads

In my previous post I described how to load textures "in the background" using threads. It's a riff on loading GL objects on a thread - in this case, we use an atomic object to swap the ID of a stub resource for the real resource.

What if we want to update a texture while rendering? Ideally we'd like to have the texture updating running 100% asynchronously from the render loop, with no blocking or locks. The short of it is that this is possible by using a second texture (which is swapped in using atomic operations); why you can't just call glTexImage2D from a thread has a more complex answer.

The answer lies in Appendix D of the OpenGL specification, which describes the rules for state update when an object is changed. The basic rules are:
  • You always see state change immediately in your own context.
  • You don't see a change in another context until you bind the object at least once after the state change is known to have completed.
That second point is a really monstrous condition. Here are a few ways this can kill you:
  • The pointer to storage of an object is state. If a command changes that storage (e.g. respecify a texture in another format), the state rules apply!
  • You don't see the state until after you bind, but if you're dealing with a stale pointer, the state may not be useful before you bind. I'm not sure how actual GL implementations manage this - I've seen white flashes for textures on OS X when incorrectly dealing with state propagation.
  • The spec requires completion in the server, e.g. after a "finish" (or a sync object in the new 3.2 spec, but that's another blog post). That's a pretty harsh condition, and I believe that most gl drivers will accept a flush. Newer, more modern drivers may require a real sync and not a flush - glFlush only works because there is serialization in communication to the card.
When you put this all together, it becomes clear that you need to have your rendering thread stop using the object before the async load starts, then restart using it with a rebind after it ends.

That's an ugly enough condition that with X-Plane we simply double-buffer: we allocate a whole new texture object, then use atomic operations to swap it in by ID, then deallocate the old one (on the worker thread). If you think about how texture memory works, there are only two other options:
  • Client code holds off rendering while the async happens (the texture itself could be in an inconsistent state).
  • The driver double-buffers for you, so that the old bind of the old texture isn't invalid. It's unclear to me from the spec whether this is required of the driver and what the cost would be. It strikes me as inefficient at best.
By double-buffering with two texture objects we also get around the problem of having to rebind - since the new texture has a new ID, we have to bind anyway.

Atomic Safety

My previous post suggests using atomic operations to switch texture IDs in and out after you've done a threaded load. Atomic operations have the nice property of not blocking, and they're often very low overhead. But...what makes an atomic operation safe?

Well, it turns out you need to be safe from four things, two in your compiler and two in hardware.

(As a side note, the Apple perf-optimization list is a great resource...I didn't know about sequence points until this post.)

Code Reordering

An optimizing compiler will try to reorder code in ways that it thinks might be faster but won't change the meaning of execution. Since the compiler doesn't know or care what other threads are doing, this could be really bad.

C/C++ have the notion of sequence points - basically the optimizer isn't allowed to move code around a sequence point. There are a few types of sequence points but the most useful one is the function call. Call a function, and you've divided your code in half.

Volatility

In C/C++ volatile memory is memory that the compiler has to write to when you tell it to. Normally the compiler might analyze the side effects of a chunk of code and change exactly how it talks to memory to improve performance. Between sequence points, the compiler can do what it wants as long as the end result is the same.

But when a variable is volatile, the compiler must write to it once each time your code says to. The original classic use would be for I/O registers, where the write operation does something, dn you need to do it exactly once or twice, or however many times the code says.

Generally you'll want the variables you use as atomics to be volatile, so that a lock gets written when you want it to be.

Atomicity

The next thing you need to be sure of is that your atomic operation really happens all at once. How this happens depends on the CPU, but fortunately you don't care - as an applications writer you use an operating system function like InterlockedExchangeAdd or __sync_add_and_fetch. All the major operating systems have these functions.

For example, incrementing a variable would traditionally be three instructions: a load, an add, and a store. This isn't very atomic; something could happen between the load and store. Using some CPU-specific technology, the OS atomic operation guarantees that the operation happens all at once. If you have a system with multiple cores and multiple caches, the atomic operation makes sure that future code isn't fooled by old cached values.

Barriers

The final part of the puzzle are memory barriers. Basically some CPUs have some freedom to reorder the reading and writing to memory. Normally this is done in a way that still produces consistent results. But if you are going to write data, and then write "unlock", you don't want those things to be reversed - writing unlock could allow another thread to look at the data...it has to be there.

A memory barrier is a synchronization point in the stream of reads and writes...basically all the writing on one side of the barrier is done before the reading on another.

Four In One

Fortunately operating systems typically give you all four properties in one place:
  • The atomic operations are functions, so they are sequence points.
  • They take inputs of type volatile T *.
  • They are atomic in implementation.
  • They have a barrier built in (or there is a variant that includes a barrier - use that one).
So are you safe? If you use an atomic op function with a barrier then yes.

Atomics and Threaded OpenGL

My previous post describes how to load a VBO or texture on a thread, then "send it" to the main thread for use. This is almost exactly how X-Plane loads its VBOs on worker threads.

For textures we use a slightly differntn strategy. The problem is that we (theoretically) need our textures immediately - that is, if we hit a part of the scene graph that references an unloaded texture, we still need to draw something.

Rather than have the state setup go into some fallback "untextured" mode, we use a proxy-and-switch scheme.
  1. When we first hit the need for a texture, we create one of our C++ objects that manages the texture. This texture object is inited to refer to a dummy gray texture that we use as our proxy for all unloaded textures.
  2. We queue the texture object to be loaded approximately whenever we get around to it. This might come up soon or it might take a while, depending on how much background work is being done.
  3. When a worker thread finally gets around to loading the texture, it loads the real image into a new texture "name" (GLuint texture ID).
  4. When the load is done and flushed, we do an atomic swap , swapping out the old gray proxy texture and swapping in the new texture.
The advantage of this is that the rendering code can be running at full speed, using this texture object (with its OpenGL texture ID inside it) without any thread safety or checking what-so-ever; it's a solution that has zero cost for the rendering engine.

And since the C++ texture object always has something in it, we can use the same shaders even before we've loaded, which simplifies the casing for shader setup quite a bit.

Creating OpenGL Textures or VBOs On A Second Thread

It's always scary when I Google for a question and find my own blog post in the top few search terms...it makes me think I've programmed my way out into the wilderness. Given how many forum questions involve people trying to figure out how to put OpenGL resource loading in a worker thread (and how many times they are told to just use one thread because it's a PITA) I figured I would blog what I have discovered through painful trial and error.

The previous post discusses setup of "loader threads" - that is, a worker thread with a shared GL context that is meant to be used to load textures and VBOs without interrupting the main thread from doing rendering work. The loader thread has a current GL context but usually no framebuffer, which is fine since we don't really want to draw.

What do we need to do to create a texture or VBO on this worker thread? The steps go something like this:
  1. Issue the GL commands to create the texture or VBO. Typically you'd have someting like glBindTexture, glTexImage2D, glTexParamBlaBlaBla.

    Remember: GL state is separate for each thread - if you have OpenGL utility code that "caches" state in global variables, you'll need to make it thread safe or something. I use one set of utilities on the main thread (with caching) to render, and a separate set on worker threads; a debug-build-only check catches incorrect use of utilities on the wrong thread.

  2. Call glFlush...more details here. But basically if you don't force Open to swallow what it's chewed, your texture-create commands might just be sitting around for a while.

    (Side note: as of August 2009, the bugs you get on Windows from not flushing a VBO built in a worker thread are truly spectacular visually...the crazy runways in the sky and other creepy effects are all due to a missing glFlush on a worker.)


  3. Finally, send a signal to the rendering thread to allow it to use the entity.

I use message queues for step 3 because I love message queues in a way that might not be healthy. To go with the "resource ownership" idea, once your worker has loaded and flushed the texture or VBO, you are "passing ownership" to the renderer via the message queue. The main thread can do something like this:
while(1) {
scene_graph.draw();
obj * new_goo = msg_queue.read(do_not_block);
if(new_goo)
scene_graph.insert(new_goo);
}

The cost to the main thread of this design is about as low as it can get - basically you only pay for bucketing your new geometry.

Monday, August 24, 2009

Putting OSM On A Diet

I've been fighting to cut an entire set of X-Plane scenery off of OSM...the stickler: Boston. Why Boston happened to be the tile that had this problem I don't know, but basically I ran out of memory. The scenery creation process unfortunately isn't very space efficient and at one point it needs two copies of the roads: a "planar map" for area calculations and a separate "network" version for actual export. I just didn't have enough RAM for two road grids, plus other raw data and a 3-d triangle mesh.

Douglas Peuker to the rescue - a 7m error limit on roads cuts data down by 25% in San Diego and 75% in Boston.

Once I have the whole process working I might raise this limit back up again...the main culprit is small sections of road with very high node counts. In a few cases I found 7m spacing along a road! The data is definitely inconsistent.