Wednesday, December 22, 2010

Fun With GLSL Compilers

I've been poking at GPU ShaderAnalyzer; this Windows performance tool from ATI gives you a simple environment: you enter GLSL, hit compile, and look at the assembly that pops out. Perfect! (It also shows you execution cycles, but for my purposes, namely understanding what the compiler does for me, the assembly is key.)

Here are some things I have learned:
  • RV790 assembly is quite complex. (Thank goodness I don't have to code it myself.) ALU instructions consist of 5 scalar sub-instructions, only one of which can have transcendental opcodes. There's a bit of fine print; this and this make useful reading. One thing to note: the ALU has a number of 'small' tricks (absolute value, negation, clamping) 'for free'. Sometimes the compiler will use these tricks, sometimes not.

    Generally, if you write vectorized code (e.g. uniform work on vec4) the scheduling will work out nicely. But the units of execution really are scaler, so it doesn't make sense to write work that isn't needed.

  • The compiler inlines pretty much everything, which is just fine by me. (I have no idea if recursion is legal in GLSL, I'd never use it in production, but when I wrote a recursive factorial function, the compiler simply inlined 127 iterations and called it a day. Awesome!)

  • The compiler understands a reasonable amount of constant folding, including (most importantly) multiply by zero. For example: I write an expensive albedo function: pow(gl_Color,gl_Color.aaaa) and multiply it by a light function that returns vec4(0.0). The result: the compiler nukes the entire code sequence and simply loads 0.

    (BTW, pow is expensive - since only one of the five ALU slots can run log and exp, raising each color channel to a non-constant power takes eight instruction groups! Ouch.)

  • The compiler will remove conditional code when the condition is fully known at compile time. So for example, an if statement where the comparison comes from functions that return constants will be nuked, and one of its two clauses is deleted.

  • The compiler does not seem to do inference, at least in the one case I looked at. By inference I mean: if (max(0.6,gl_FragColor.r) > 0.3) will (ignoring NaN logic) always be true, regardless of gl_FragColor. But for the compiler to know this, it has to make an inference - that is, it has to compare the range [0.6..inf) with 0.3. My understanding is that LLVM can do this kind of thing, but when I tried it in shader I simply got the full, expensive, conditional code. Moral of the story: use and apply your human brain. :-)
Now...why do I care?

X-Plane's physical shader is based on conditional compilation - that is, for any given state vector of "tricks" we want to use, we recompile the shader with some #defines at the front which turn features on and off. The result is a large number of shaders, none of which need conditional logic in-shader. Fill rate isn't consumed by features we don't use. (This technique comes from our original use of GLSL to emulate and then improve on the fixed function pipeline. To match fixed-function performance, we had to 'compile out' anything we didn't use, particularly for first-gen DX9 hardware which doesn't give you conditional logic for free.

The problem with this technique (and you can see this in the X-Plane 9 shaders) is that it doesn't scale well with code size. For version 10 we've done a lot of shader work, and hand-optimizing the conditional logic is getting more and more difficult.

My conclusion from observing the compiler is that 99% of the time, I can relax a little bit and let the compiler take care of optimizing the shaders down. In particular, if I define functions for each stage of the shader and use conditional compilation to 'simplify' the rule, then the simple cases will boil down to very few instructions. For example:
float calc_spec()
{
#if has_spec
return pow(max(0.0,dot(eye_nrm,sun-vec)),128.0);
#else
return 0.0;
#endif
}
void main()
{
....
float s = calc_spec();
gl_FragColor = albedo * lighting * shadow + ambient + shadow * vec4(s,s,s,0.0);
}
In this mess, our specularity function is subject to conditional removal for non-shiny materials. When we do this, not only is the actual specularity calc removed, but the compiler will figure out that 's' will alwys be 0.0 and nuke the MAD of shadow * specularity into the final lighting sum.

That's a trivial example, but it shows the principle of structuring the components separately and letting the compiler put the mess together.

As a final note: the compiler's optimization is not perfect; I suspect the above technique will 'leak' a few instructions in the simple cases relative to a one-off carefully hand-coded GLSL shader, and the GLSL isn't going to be quite as tight in a few cases as actually writing assembly.

But I can live with that, most of the time. We can always go and hand tune the performance cases that absolutely matter most, and the time saved working on the huge mess that is the conditional shader gives me the time to do that hand optimization where it is most important.

Monday, December 20, 2010

Lisp Isn't a Language...

Someone sent me this fun list of programming quotes. From Alan Kay:
Lisp isn't a language, it's a building material.
And that material is dung.

Thursday, December 09, 2010

What OOP Isn't

When I took my first computer science class (I had already been programming on my own for a while) the department was going through something of a civil war. Some of the department had gotten religion in the form of object oriented programming (OOP) and were trying to thrust it on everything. They made us implement a linked list in an OOP way: every node was fully encapsulated!

(If you're wondering how this works, most list editing operations had to be stack-recursive - a node would set its 'next' to the return value of a call on its next, allowing a node to 'cut itself out'. It made it very hard for students who had never used linked lists to understand what was going on, because they had to learn recursion and linked lists at the same time. The result was something with the performance of LISP and elegance of C++. It was horrible.)

They told us that OOP involved encapsulation, polymorphism, and inheritance; I have commented in the past on why this last idea is often just a poor idea. At the time, in school I only had enough programming experience to say that what we were being taught (all OOP, all the time, period) was a lot more difficult than what I had been doing (use an object for something big, like a game character) and was producing code that was more verbose and not particularly fast. Now that I have some software engineering experience, I think i can articulate the problem more precisely.

When talking to a new programmer who is looking at OOP and trying to figure out what it's all about, I say that the relative importance of encapsulation, polymorphism, and inheritance is approximately 90%, 10%, 0% respectively. The vast majority of the value of OOP is that it provides an idiom to keep one piece of code from becoming hopelessly intertwined with other pieces of code, and that's valuable in large software projects. It's also impossible to teach to undergraduates because they never have a chance to write enough code for it to matter.

Polymorphism is nice, but in my experience it's not as useful as encapsulation. If you have a polymorphic interface, you have an interface, which means that it's encapsulated...but there are plenty of cases where an interface is one-off and has no polymorphic properties. Maybe 90%-10% is harsh, but I think it's the encapsulation that matters. It may be that some product spaces are more polymorphic than others. WorldEditor (LR's open source scenery editor) has polymorphic hierarchies for most of its core components, while X-Plane itself has very few.

I bring this up because I'd like to advance (in a future blog post) a comparison of OOP techniques to others (for real software engineering problems), but OOP comes with a bit of baggage. The notion that OOP would make us better programmers, help us write bug free code faster, or help bad programmers become good programmers have all proven to be naively optimistic. (In particular, bad programmers have proven to be surprisingly resourceful at writing bad code given virtually any programming idiom.)

So I'd like to define (OOP - hype) as something like: good language support for idioms that make encapsulation and sometimes polymorphic interfaces faster to code. And that's useful to me! I could code the same thing in pure C, but it would make my repetitive stress injuries worse from more typing, so why do that?

FMTT, GLSL Edition

This is in the same vein as I Hate C -- all of its derivatives are contaminated with its brain damage.
gl_FragData[0] = vec4(tex_color.rgb * gl_Color.rgb*tex_color.a,clamp(tex_color.a + lit_color.a,0.0,1.0));
gl_FragData[1] = vec4(shiny_ao * cut_pos, cut_pos*position_eye.z/-1024.0, 0.0, cut_pos);
gl_FragData[2] = vec4(normal_eye_use.xyz*cut_pos, cut_pos);
gl_FragData[3] = vec4(lit_color.rgb + tex_color.rgb * gl_FrontLightModelProduct.sceneColor.rgb, (tex_color.a + lit_color.a,0.0,1.0));

Sunday, December 05, 2010

Yet Another This-Is-Our-GBuffer-Format Post

If you read just about any presentation by a game studio (E.g. for GDC or Sigraph) on deferred rendering or deferred lighting, they'll probably discuss their G-Buffer format and how they tried to pack as much information into a tiny space as possible. X-Plane 10 will feature a deferred renderer (to back the global spill feature set). And...here is how we pack our G-Buffer.

The good news with X-Plane is that we don't have a complex legacy material system to support. If a game is well-batched, the forward renderer can associate separate shaders with batches by material, and each 'material' can thus be radically different in how it handles/transfers light. With a G-Buffer, the lighting equation must be unified and thus everything we need to know to support some common lighting format must go into the G-Buffer. (One thing you can do is pack a material index into the G-Buffer.) Fortuantely, since X-Plane doesn't have such a multi-shader beast we only had one property to save: a shininess ratio (0-1, about 8 bits of precision needed).

What we did have to support was both a full albedo and a full RGB emissive texture; additive emissive textures have been in the sim for over a decade, and authors use them heavily. (They can be modulated by datarefs within the sim, which makes them useful for animating and modulating baked lighting effects.) X-Plane 10 also has a static shadow/baked ambient occlusion type term on some of the new scenery assets that needs to be preserved, also with 8 bits of precision.

The other challenge is that X-Plane authors use alpha translucency fairly heavily; deferred renderers don't do this very well. One solution we have for airplanes is to pull out some translucent surfaces and render them post-deferred-renderer as a forward over-draw pass. But for surfaces that can't be pulled out, we need to do the least-bad thing.

The Format

Thus we pack our G-Buffer into 16 bytes:
  1. RGBA8 albedo (with alpha)
  2. RG16F normal (Z vector is reconstructed)
  3. RG16F depth + shadow-merged-with-shininess
  4. RGBA8 emissive (with alpha)
When drawing into the G-Buffer, we use full blending for the albedo and emissive layers, but we always set the alpha for the depth/normal layers to 0.0 or 1.0, thus using the alpha blend as a per-render-target "kill" switch. This is based on the level of translucency. Thus if something is highly transparent, we keep the physical position of what is behind it (light passes through) but if it is opaque enough, we over-write the position/normal (light bounces off of it). It's not perfect, but it's the least bad thing I could come up with.

(As a side note, if we had a different layout, we could blend the shininess ratio, for example, when we keep a physical position fragment, to try to limit shininess on translucent elements.)

Note that on OS X 10.5 red-green textures are not available, so we have to fall back to four RGBA_16F textures, doubling VRAM. This costs us at least 20% fill-rate on an 8800, but the quick sanity test I did wasn't very abusive; heavier cases are probably a bit worse.

So far we seem to be surviving with a 16-bit floating point eye-space depth coordinate. It does not hold up very well in a planet-wide render though, for rendering techniques where reconstructing the position is important (e.g. O'Neil-style atmospheric scattering). A simple workaround would be to calculate the intersection of the fragment ray with the planet directly by doing a transform of the planet sphere from model-view space. (E.g. if we know that our fragment came from a sphere, why not just work with the original mathematical sphere.)

16F depth does give us reasonably accurate shadows, at least up close, and far away the shadows are going to be limited in quality anyway. I tried logarithmic depth, but it made shadows worse.

Packing Shadowing and Shine

For the static shadow/AO ("shadow") term and the level of specularity ("shine") we have two parameters that need about 8 bits of precision, and we have a 16-bit channel. Perfect, right? Well, not so much.
  • NVidia cards won't render to 16-bit integer channels on some platforms.
  • ATI cards don't export a bit-cast from integer to floating point, making it hard to pack a real 16-bit int (or two 8-bit ints) into a float.
  • If we change the channel to RGBA8 (and combine RG into a virtual 16F) we can't actually use the fourth byte (alpha) because the GL insists on dumping our alpha values. Extended blend would fix this but it's not supported on OS X and even on Windows you can't use it with multiple render targets.
So we can't actually get the bits we pay for and that sucks. But we can cheat. The trick is: the deferred renderer will cut out specular hilights that are in shadow. Thus as the shadow term becomes stronger, the shine term becomes unimportant.

So we simply encode 256.0 * shadow + shine. The resulting quantity gives shininess over 8 bits of precision without shadow, and reduces shininess to around 2 bits in full shadow. If you view the decoded channels separately you can see banding artifacts come in on the shine channel as the shadows kick in. But when you view the two together in the final shader, the artifacts aren't visible at all because the shadow masks out the banded shininess.

(What this trick has done is effectively recycle the exponent as a 'mixer', allocating bits between the two channels.)

Future Formats

A future extension would be to add a fifth 4-byte target. This would give us room to extend to full 32-bit floating point depth (should that prove to be useful), with shadow and shine in 8 bit channels with one new 8-bit channel left. Or alternatively we could:
  • Keep Z at 16F and keep an extra channel.
  • Technically if we can accept this 'lossy' packing we can get four components into an RG16F, while we can only get 3 pure 8-bit components. (This is due to how the GL manages alpha.)
  • If we target hardware that provides float-to-int bit casts, we could have four 8-bit components in the new channel.

Saturday, December 04, 2010

Semaphore Follow-Up: NTPL

A quick follow-up from my previous post on condition variables, etc. With NTPL (the pthreads implementation on Linux) a lot of the original issues I was trying to cope with don't exist. Some things NTPL does:
  • pthread mutexes are spin-sleep locks, so they can be used as short-term critical sections without too much trouble. Given a moderately contested but shortly held lock, this is a win.
  • sem_t semaphores have an atomic counter to avoid system calls in the uncontested case. When inited privately (sem_init) they appear to be lean and mean.
  • All synchronization is done around futexes, ensuring that uncontested cases can be manged with atomic operations. (The OS X pthreads library at least uses spin locks around user space book-keeping for the uncontested case, but I think the futex code path is faster.)
There is one case where using a condition variable really would be superior to a semaphore on Linux: if you really want a condition variable (and aren't just using it to build a semaphore). In particular, the futex system call that helps NTPL sleep threads as needed has special operations to move a thread from one queue to another while asleep. This fixes the thundering herd problem when every thread on a condition is woken up at once. This isn't something I use, but if you need it, NTPL makes it fast.

Friday, December 03, 2010

Performance of Semaphore Vs. Condition Variable

I was looking at the performance of X-Plane's inter-thread messaging. Inter-thread message costs haven't been a huge concern to us in the past because the jobs send to worker threads tend to be quite large, amortizing overhead, and they are usually returned asynchronously, so we don't care about latency.

X-Plane 10 features a threaded flight model (or rather, the flight model of all airplanes is executed in parallel). This is a case where we do care about latency (since we can't proceed until the flight model completes*) and the job size is not that big, so overhead matters more.

Our old message queue was based on a pthread condition variable. It was taking about 200 µsec to launch all workers, and a Shark analysis showed what I can only describe as a "total train wreck" of lock contention. The workers were waking up and immediately crashing into each other trying to acquire the mutex that protects the condition variable's "guts".

I replaced the condition variable + mutex implementation of the message queue with the implementation that already ships on Windows: a critical section to protect contents plus a semaphore to maintain count. I implemented the semaphore as an atomic wrapper around a mach semaphore. (See the libdispatch semaphore code for an idea of how to do this.) The results were about 2x better: 80 µsec to launch from sleep, about 45 µsec of which were spent signaling the mach semaphore (5-10 µsec per call) plus a little bit of overhead.

So why was the new implementation faster? I dug into the pthreads source to find out.

What Happens With PThreads

The pthread implementation on OS X (at least as of 10.5.x whose source I browsed, it's in libc btw) goes something like this:
  • A spin lock protects the "guts" of both a pthread mutex and a pthread condition variable.
  • An OS semaphore provides the actual blocking for a pthread mutex and a pthread condition variable.
Mutex lock/unlock is therefore not as good as a real spinlock (for the cases where spinlocks are, um, good) but it does have a fast path where, if the lock is uncontested, you can get in, take the lock and get out without ever having to make a system call.

The condition variable sequence is a little bit trickier. A signal to a condition variable with no waiters is a fast early exit, but we have to grab the spin lock twice if we signal someone. The condition wait is the painful part. pthread_cond_wait has to reacquire the locking mutex once the waiter is signaled. This is the source of the deadlock; if a lot of messages are written to a lot of worker threads at once, the worker threads bottleneck trying to reacquire the mutex that is tied to the condition variable that woke them up.

Better With Semaphores

When using the atomic semaphore/critical section design, the advantage we get is significantly finer-grained locking. The lock on our message queue now only protects the queue items themselves, and is never held while we are inside pthread code. This gets us through significantly faster. It is inevitable that if we are going to queue work to many threads via one queue, there's going to be some bottleneck trying to get the messages out. This design minimizes the cost of getting a message out and thus minimizes the serialization path.

(Shark doesn't have quite the res to measure it in the code example I have running now, but I suspect that some of the cost still in the system is the spin time waiting for messages to be pulled.)

Part of the win here is that we are using a spin lock rather than a sleep lock on the message queue; while the pthread mutex will fast-case when uncontested, it will always sleep the thread if even two threads are trying to acquire the mutex. If the protected piece of code is only a few instructions, a spin is a lot cheaper than the system call to block on a semaphore.

Better Throughput When Pre-Queued

One of the things that made the old design go haywire in the version 10 test case was that it was a two-level set of messsage queues, with the main job list containing a message to go check another queue for the actual work. (Believe it or not there is a good design reason for this, but that's another post.) In this two-level design the semaphore + critical section hits its fast case. The second queue is already full by the time we start waking up workers. Therefore the second queue's semaphore is positive, which means that the decrement in the message dequeue will hit the fast atomic operations case and be non-blocking. The workers then simply have to grab the spin lock, pop the message, and be done.

(Again, the pthread implementation can fast-case when the mutex is uncontested, but it doesn't spin.)

To forestall the comments: this is not a criticism of the performance of pthread condition variables and mutices; performance could certainly also have been boosted by using a different set of pthread synchronizers. It is mainly an observation that, because the pthread mutex is heavier than a spin lock, condition variables may be expensive when you wanted a light-weight lock but didn't get one.

Future Performance Boosts

There is one additional performance boost I have been looking at, although I am not sure when I will put it into place. The message queue (which is a lock around an STL list right now**) could be replaced with a ring buffer FIFO. With the FIFO design we would maintain:
  • A pair of semaphores, counting filled and free entries. Reading/writing blocks until the needed filled/free entry is available.
  • Atomic read/write pointers that can be advanced once the semaphore indicates that there is content/free space to advance to.
This design would have the advantage that it doesn't need to spin, with message read/write through the ring buffer being almost fully independent. Furthermore, since we have atomics wrapping our semaphores, the non-blocking case (reading when there are messages, writing when there is free space) is -blocking and non-spinning.

Two things stop me from dropping this in:
  • Our current message queue API assumes that messages can always be written and that writing is reasonably fast. (The current write only needs to acquire a spin lock.) If the ring buffer can fill, we could block indefinitely until space is available, or we have to expose to client code that writes can time out. That's a pretty big change.
  • The current queue has a lock-and-edit operation that is used to optimize the case where queued jobs have to be flushed. Since the ring buffer FIFO is lock free, we can't really lock it to inspect/edit it.
* We looked at implementing a truly asynchronous flight model, but the interaction between third party add-ons, the flight model, and the rendering engine was such that the synchronization issues were too large for a relatively small win (the reduction of latency of one flight model). flight models against each other gets us most of the win in the cases where the flight model really costs us something.

Tuesday, November 30, 2010

Is 1% A Lot?

When optimizing code, is a 1% optimization (that is, an optimization that reduces run time by 1% a lot)? Well, yes and no. For any code optimization, we have to look at two important factors: leverage and repeatability.

Leverage

Let's say our game spends 90% of the time drawing the 3-d world and 10% drawing the UI. Your leverage ratios are therefore 0.9 for 3-d and 0.1 for the UI. Those are the scaling factors that discount the value of your optimizations. So a 1% optimization to the rendering engine will give you a 0.9% speed boost overall, while a 1% optimization to the UI will give you only a 0.1% speed boost overall.

So my first answer is: 1% is not a lot unless you have a lot of leverage. If your leverage ratio is 0.05 you don't have a lot of leverage, and even a 20% optimization isn't going to produce noticeable results.

This is why using an adaptive sampling profiler like Shark is so important. A Shark time profile is your code sorted by leverage. Let's take a look at a real Shark profile to see this.

This is a Shark profile of X-Plane 9.62 on my Mac Pro in an external view with pretty much the default settings. I have used Shark's data mining features to collapse and clean the display. Basically time spent in the sub-parts of the OpenGL driver have been merged into libGL.dylib and libSys is merged to whomever calls it. We might not do this for other types of optimizations, but here we just want to see "who is expensive" and whether it's us or the GL.

The profile is "Timed Profile (All Thread States)" for just the app. This captures time spent blocking. Since X-Plane's frame-rate (which is what we want to make fast) is limited by how long it takes to go around the loop including blocking, we have to look at blocking and focus on the main thread. If we were totally CPU bound (e.g. other worker threads were using more than total available CPU resources) we might simply look at CPU use for all non-blocking threads. But that's another profile.

What do we see? In the bottoms-up view we can see our top offenders, and how much leverage we get if we can improve them. In order of maximum leverage:
  1. glDrawElements! With 35.6% of thread time (that is, a leverage ratio of 0.356) the biggest single thing we could do to make X-Plane faster is to make glDrawElements execute faster. 0.356 is a huge number for an app that's already been beaten into an inch of its life with Shark for a few years.

    This isn't the first time I have Sharked X-Plane, so I can tell you a little bit of the back story. This Shark profile is on a GeForce 8800 running OS X 10.5.8; a lot of the time is OpenGL state sync in the driver (that is, the CPU preparing to send instructions to the card to change how it renders), and some is spent pushing vertex data in a less-than-efficient manner. This number is a lot smaller on 10.6 or an ATI card.

    While this call isn't in our library, we still have to ask: can we make it faster? The answer, as it turns out, is yes. When you look at what glDrawElements is doing (by removing the data mining) you'll see most of its time is spent in gldUpdateDispatch. This is the internal call to resynchronize OpenGL state. So it's really us changing OpenGL state that causes most of the time spent in glDrawElements. If we can find a way to change less state, we get a win.

  2. Our quad-tree comes up next with 12.9%. That's still a big number for optimization. If we could make our quad tree faster, we might notice.

    Once again, I have looked at this issue before; in the case of the quad tree the problem is L2 cache misses - that is, the quad tree is slow because the CPU keeps waiting to get pieces of the quad tree from main memory. If we could change the allocation pattern or node structure of the quad tree to have better locality, we might get a win.

    (How do we know it's an L2 cache issue? Shark will let you profile by L2 cache misses instead of time. If a hot spot comes up with both L2 and time, it indicates that L2 misses might be the problem . If a hot spot comes up with time but not L2 misses, it means you're not missing cache, something else is making you slow.)

  3. Third on the list is a real surprise - this routine tends the 3-d meshes and sees if anything needs processing. To be blunt, this stat is a surprise and almost certainly a bug, and I did a double-take when I saw it. More investigation is needed; while 7.6% isn't the biggest performance item, this is an operation that shouldn't even be on the list, so there might be a case where fixing one stupid bug gets us a nice win.

  4. plot_group_layers is doing the main per-state-change drawing - it's not a huge win to optimize because the leverage isn't very high and the algorithm is already pretty optimal - that is, real work is being done here.

  5. Mipmap generation is done by the driver, but we if we can use some textures without auto-mipmap generation, it might be worth it.

  6. glMapBufferARB is an example of why we need to use "all thread states" - this routine can block, and when it does, we want to see why our fps is low - because our rendering thread is getting nothing done.

  7. glBegin is down at 2% (leverage ratio 0.02), and this is a good example of cost-benefit trade-offs. X-Plane still has some old legacy glBegin/glEnd drawing code. That code is old and nasty and could certainly be made faster with modern batched drawing calls. But look at the leverage: 0.02. That is, if we were able to improve every single case of glBegin by a huge factor (imagine we made it 99% faster!!) we'd still see only a 2% frame-rate increase.

    Now 2% is better than not having 2%, bu it's the quantity of code that's the issue. We'd have to fix every glBegin to get that win, and the code might not even be that much faster. Because the code is so spread out and the leverage is low, we let it slide. (Over the long term glBegin code will be replaced, but we're not going to stop working on real features to fix this now.)

The profile also shows a top-down view. This view gives you a strategic view of where the overall leverage might be. We're spending 77% of our time in scenery. Right there we can say: most of the leverage is in the scenery engine - a lot more than might be in the flight model. (In fact, the entire flight model is only 5.7%.) Most of this time is then in the DSF. The airplane shows up (9.6%) but within it, the vast majority is the OBJs atttached to the airplane.

In fact, if you look at the two profiles together, the low level leverage and strategic view start to make sense . Most of that glDrawElements call is due to OBJs drawing, so it's no surprise that the airplane shows up because it has OBJs in it.

Repeatability

So the value of an optimization is only as good as its leverage, right? Well, not quite. What if one of my shaders can be optimized by 50%, but it's one of ten shaders? Well, that optimization could be worth the full 50% if I repeat the optimization on the other shaders.

In other words, if you can keep applying a trick over and over, you can start to build up real improvement even when the leverage is low. Applying an optimization to multiple code sites is a trivial example; more typically this would be a process. If I can spend a few hours and get a 1% improvement in shader code, that's not huge. But if I can do that every day for a week, that might turn into a 10% win.

So to answer the question "is 1% a lot", the answer is: yes if the leverage is there and you're going to keep beating on the code over and over.

Monday, November 29, 2010

Basis Projection

In my previous post I described a transform matrix as a sort of kit of "premade parts" that each vertex's components call out. Thus a vertex of (1, 0.5, 0) means "use 1 measure of the X basis from the matrix, 0.5 measures of the Y basis, and skip the Z basis, we don't want any of that." You can almost see a matrix as a form of encoding or compression, where we decode using the "premade parts" that are basis vectors. (I must admit, this is not a very good form of compression, as we don't save any memory.)

So if our model's vertices are encoded (into "model space") and a model positioning matrix decodes our model into world space (so we can show our model in lots of places in the world), how do we encode that model? If we have one house in world space, how do we encode it into its own model space?

One way to do so would be "basis projection". We could take the dot product of each vector in our model* with each basis vector, and that ratio of correlation would tell us how much of the encoding vector we want. What would a matrix for this encoding look like?
[x][Xx Xy Xz]
[y][Yx Yy Yz]
[z][Zx Zy Zz]
where we are encoding into the vector X, Y, and Z, described in the world coordinates.

So we have put our basis vectors for our model into the rows of the matrix to encode, while last time we put them into the columns to decode.

Wait - this isn't surprising. The encode and decode matrices are inverses and the inverse of an orthogonal matrix is its transpose.

Putting this together, we can say that our orthogonal matrix has the old basis in the new coordinate system in its columns at the same time as it has the new basis in the old coordinate system in its rows. We can view our matrix through either the lens of decoding (via columns - each column is a "premade part") or encoding (via rows - how closely do we fit that "premade part").

Why bring this up? The previous post was meant to serve two purposes:
  1. To justify some particularly sparse implementations of camera operations. (E.g. how did we get the code down to so few operations, why is this legal?)
  2. To try to illustrate the connection between the symbols and geometric meaning of matrices and vectors.
The observation that basis projection and application are two sides of the same coin is a segue into my next post, which will be on the core foundation of spherical harmonics, which is basically the same basis trick, except this time we are going to get compression - a lot of it.

* As in the previous article, we can think of our model as a series of vectors from the origin to the vertices in our mesh.

Sunday, November 28, 2010

Change of Basis, Revisited

A while ago I suggested that we could find the billboard vectors (that is, the vectors aligned with the screen in model-view space) simply by looking at the matrix itself. A commenter further pointed out that we could simply transpose the upper 3x3 of our model-view matrix to invert the rotational component of our matrix. Let's look at these ideas again.

Transform Matrix As Basis Vectors

If we have a 3x3 matrix T of the form:
[a d g]
[b e h]
[c f i]
Then when we multiply a vector (x,y,z) by this matrix T, x gives us more or less of (a,b,c), y of (d,e,f) and z of (g,h,i). In other words, you can think of x, y, and z being orders for premade "amounts" of 3 vectors. In the old coordinate system, x gave us one unit of the 'x' axis, y one unit of the 'y' axis, and z one unit of the 'z' axis. So we can see (a,b,c) as the old coordinate system's "x" axis expressed in the new coordinate system, etc.

This "change of basis" is essentially an arbitrary rotation about the origin - we are taking our model and changing where its axes are. Use the data with the old axes and you have the model rotated. So far everything we have done is with vectors, but you can think of a point cloud as a series of vectors from the origin to the data points.

Big Words

A lot of math for computer programmers is just learning what the mathematicians are talking about - math has a lot of vocabulary to describe ideas. Sometimes the words are harder than the ideas.

Our rotation matrix above is a set of orthonormal basis vectors. (We call the matrix an orthogonal matrix.) What does that mean? It means two things:
  • Each basis vector is normal - that is, its length is 1.
  • Each basis vector is orthogonal to all other basis vectors - that is, any two basis vectors are at right angles to each other.
We'll come back to these mathematical properties later, but for now, let's consider what this means for our view of a transform matrix as using "premade pieces" of the coordinate axes.
  • Our model isn't going to change size, because each basis vector is of length 1 (and in the original coordinate system, 1 unit is 1 unit by definition).
  • Because the axes are all orthogonal to each other, we're not going to get any squishing or dimension loss - a cube will stay cubic. (This would not be true if we had a projection matrix! Everything we say here is based on fairly limited use of the model-view matrix and will not be even remotely true for a projection matrix.)
Translation Too

If we want to reposition our model completely (rotate and translate) we need a translation component. In OpenGL we do that like this:
[ x]
[ R y]
[ z]
[0 0 0 1]
In this matrix, R is our 3x3 rotation matrix, which is orthogonal, x y z is the offset to apply to all points. If we apply this to vectors of the form (x,y,z,1) then the math works out to first rotate the points x,y,z by r, and then add x,y,z after rotation. The last coordinate 'w' will remain "1" for future use. The upper right 3x3 matrix is orthogonal, but the whole matrix is not; if this were a 4-dimensional basis change, the vector (x,y,z,1) would not necessarily be normalized, nor would it be perpendicular to all other bases.

There's a name for this too: an affine transformation matrix. If we ever have a liner transform (of which a set of orthonormal basis vectors is one) plus a translation, with 0....1 on the bottom, we have an affine transformation.

Affine Transformation As Model Position

If we have an affine transform matrix built from three orthonormal basis vectors (our old axes in the new coordinate system) plus an offset, we have everything we need to position a model in 3-d space. Imagine we have a house model, authored as points located around an origin. We want to position it at many locations in our 3-d world and draw it each time. We can build the transform matrix we want. Typically we'd do a sequence like:
glTranslate(x,y,z);
glRotate(heading,0,1,0);
glRotate(pitch,1,0,0);
glRotate(roll,0,0,1);
That is, first move the models origin to this place in world space, then rotate it. This forms an affine matrix where the right most column is x,y,z,1 and the left three columns are the location of the model's X, Y, and Z axes in world space (with 0 in the last digit). The bottom row is 0 0 0 1.

Usually it's cheaper storage-wise to store the component parts of a model's transform than the entire transform matrix. But if we do want to store the object in the format of a transform, we do know a few things:
  • The bottom row is 0 0 0 1 so we can simply delete it, cutting our matrix down from 16 floats to 12.
  • We can get the object "location" in world-space directly out of the right-hand column.
  • The upper-left 3x3 matrix contains all of the rotations.
If we have only used rotations and not mirroring, we can decode the Euler angles of the rotation from this upper left 3x3; that'll have to be another post.

The main reason to store model location as a matrix (and not the original offset and rotation angles) is for hardware instancing; we can stream a buffer of 12-float matrices to the GPU and ask it to iterate over the mesh using something like GL_ARB_draw_instanced. But if you're on an embedded platform and stuck in fixed point, replacing angles (which require trig to decode) with the matrix might also be a win.

glScale Wasn't Invited To The Party

We cannot scale our model using glScale and still have an orthonormal basis; doing so with any scale values other than 1 or -1 would make the basis vectors be non-unit-length, and then it would not be orthonormal. We can have mirroring operations - there is no requirement that an orthonormal basis maintain the "right-handedness" of a coordinate system.

With X-Plane we don' do either of these things (scale or mirror); in the first case, we don't need it, and it's a big win to know that your upper left 3x3 is orthonormal - it means you can use it to transform normal vectors directly. (If the upper left 3x3 of your model-view scales uniformly, your normals change length, which must be fixed in shader. If your model-view scales non-uniformly, the direction of your normals get skewed.) We don't mirror because there's no need for it, and for a rendering system like OpenGL that uses triangle direction to back-face-cull, changing the coordinate system from right-handed to left-handed would require changing back-face culling to match.

Stupid Matrix Tricks

There are a few nice mathematical properties of orthogonal matrices. First, the transpose of the matrix is its inverse. To demonstrate this, take an orthogonal matrix and multiply it by its transpose. You'll find that every component is either the dot product of two orthogonal vectors or a unit length vector with itself, thus it forms the 0s and 1s of an identity matrix.

That's cool right there - it means we can use a fast operation (transpose) instead of a slow operation (invert) on our upper left 3x3. It also means that the inverse of an orthogonal matrix is orthogonal too. (Since the inverse is the transpose, you can first invert your matrix by transposing, then multiply that new matrix by its transpose, which is the original, and multiply the components out - you'll find the same identity pattern again.)

If an orthogonal matrix's inverse is orthogonal, so is its transpose, and from that you can show that multiplying two orthogonal matrices forms a new orthogonal matrix - that is, batching up orthogonal matrix transforms preserves the orthognality. (You can calculate the components of the two matrices, and calculate the transpose from the multiplication of the transposes of the sources in opposite order. When you manipulate the algebra, you can show that the multiplication of the two orthogonal matrices has its transpose as its inverse too.)

If we know that an orthogonal matrix stays orthogonal when we multiply them, we can also show that affine matrices stay affine when we multiply them. (To do this, apply an affine matrix to another affine matrix and note that the orthogonal upper left 3x3 is only affected by the other matrices' upper 3x3, and the bottom row stays 0 0 0 1.) That's handy because it means that we can pre-multiply a pile of affine transform matrices and still know that it's affine, and that all of our tricks apply.

Camera Transforms

Camera transforms are funny beasts: when we move the camera, we do the opposite transforms of moving the model, and we do them in the opposite order. So while we positioned our model by first translating, then rotating, we position our camera by first rotating, then translating, and we do everything in the opposite order. (Consider if we want to position the camera at 10,0,0 in world space, we really achieve this by moving the model to -10,0,0.)

This means we can't recover information about our camera location directly from the modelview matrix. But thanks to all of the properties above, we can make camera angle recovery cheap.

Our orthogonal matrix contains the location of the old coordinate system's axes (in terms of the new coordinate system) as columns. But since its transpose is its inverse, it also contains the location of the new coordinate sytem's axes (in terms of the old coordinate system) as rows. Our model view matrix transforms from world space to camera space. So we have the axes of the camera space (that is, the "X" axis of camera space/eye space is an axis going to the right on your monitor) in terms of world space, right there in the rows. Thus we can use the first, second and third row of the upper left 3x3 of an affine transform matrix to know the billboard vectors of an affine modelview matrix.

The location of the camera is slightly trickier. The right most column isn't the negative of the position of the camera. Why not? Well, remember our "model positioning" transform was a translate-then-rotate matrix, with the translation in the right column. But camera transforms happen in the opposite order (negative-, then negative-translate). So the location of the camera is already "pre-rotated". Doh.

Fortunately we have the inverse of the rotation - it's just the transpose of the orthogonal 3x3 upper left of our affine matrix. So to restore the camera location from a model-view matrix we just need to multiply the upper right 1x3 column (the translation) by the transpose of the upper left 3x3 (the rotation) and then negate.

Having the camera direction and location in terms of the modelview matrix is handy when we have code that applies a pile of nested transforms. If we have gone from world to aircraft to engine coordinates (and the engine might be canted), we're in pretty deep. What are the billboarding vectors? How far from the engine are we? Fortunately we can get them right off of the modelview matrix.

EDIT: one quick note: an affine transform is still affine even if the upper left matrix isn't orthogonal; it only needs to be linear. But if we do have an orthogonal matrix in the upper left of our affine matrix, multiplying two of these "affine-orthogonal" matrices together does preserve both the affine-ness of the whole matrix and the orthogonality of the upper left. I'm not sure if there is an official name for an affine matrix with an orthogonal upper left.

Friday, November 26, 2010

More STL Abstraction

A while ago I made the claim that the STL is not an abstraction because the specification it implements is so specific with regard to performance that it's really just an implementation. Response was swift and furious, and my contrast of the STL to something like an SQL query may have been lost.

Which begs the question: why is anyone reading this? What the heck? Chris and I have been terrified to find anything from this blog on the first page of a Google search or reposted somewhere. So if you're reading this: there will be no refunds at the end of this article if you feel you've lost 5 minutes of your life you'll never get back. You have been warned.

Anyway, I was looking at one case of the STL where you actually don't know what kind of performance you'll get unless you know things about your specific implementation. When you insert a range into a vector, the insert code has two choices:
  1. Iterate the range once, incrementally inserting. This may cause excess reallocation (since we don't know how much more memory we need until we're done), and that reallocation may in turn call excess copy construction and destruction.
  2. Measure the distance of the range. This lets us do one allocation and a minimum number of copies, but if the iterator doesn't have random access differencing, it would mean the range is iterated twice.
The GCC 4 STL that we use will only pick choice 2 if the iterator is known to be random access. (I haven't scrubbed the traits system to see how good it is at determining this when iterators don't use appropriate tags.) Whether this is a win or not can't be known by the STL, as it doesn't know the relative cost of iteration vs. object copy vs. memory allocation.

I have seen other cases where you can't quite guess what STL performance might be. For example, some implementations cache list size for O(1) list::size() while others do not cache and have to actually traverse the list. The SGI STL documentation does declare what the worst behavior is, so I have no right to complain if the list size isn't cached.

My argument isn't that the STL should always do the right thing by reading my mind. My argument is that because the STL is such a low level of abstraction, and because it serves such a low level purpose in code, the performance of the implementation matters. There may not be one right container for the job, and in trying to decide between a vector and list, whether I get single-allocate-insert on vector or constant-time size on the list might matter.

Fortunately in real development this turns out to be moot; if performance matters, we'll run an adaptive sampling profiler like Shark on the app, which will tell us whether for our particular usage and data the STL is under-performing. In a number of cases, our solution has been to toss out the STL entirely for something more efficient; as long as that's on the table we're going to have to profile, which will catch STL implementation differences too.

Tuesday, November 23, 2010

The Value Of Gamma Compression

A number of engineers all had the same response to the sRGB/Gamma thread on the Mac OpenGL list: life would be a lot easier if color were linear. Yes, it would be easier. But would it be beautiful?

The answer is: not at 8 bits, and definitely not with DXT compression.

The following images show a gray-scale bar, quantized to: 16, 8, 6, and 5 bits per channel. (16-bits per channel would be typical of a floating point, HDR, or art asset pipeline, 8-bits is what most apps will have to run on the GPU, and 5/6 bits simulate the banding in the key colors of DXT-compressed textures, which are 5-6-5.)





In the images labeled "srgb" (gamma is 1.0) the colors are quantized in sRGB (non-linear) space. Becuase sRGB is perceptually even, the banding appears to be even to a human - it's a good use of our limits bits. 8-bit color is pretty much smooth, and artifacts are minimized for 5 and 6 bits (although we can definitely see some banding here.)

Now what happens if we quantize in linear space? You'd get this:





Note: the program generates these ramps in sRGB space (hence they are "evenly spaced", converts to linear, quantizes, then converts back. So this is what your textures would look like if your art assets were converted to and stored linearly.

What can we see? Well, if we have 16-bits per channel we're still okay. But at 8-bits (the normal way to send an uncompressed texture to the GPU) we have visible banding in the darker regions. This is because linear isn't an efficient way to space out limited bits for our eyes.

The situation is really bad for the 6 and 5-bit compressed textures; we have so little bandwidth that the entire dark side of the spectrum is horribly quantized.

The moral of the story (if there is one): gamma is your friend - it's non-linear, which is annoying for lighting shaders, but when you have 8 bits or less, it puts the bits where you need them.

Monday, November 22, 2010

I hate C, part 492.

FMTT.

HIWindowCopyColorSpace_f = (CGColorSpaceRef (*)(WindowRef))

CFBundleGetFunctionPointerForName, tib, CFSTR("_HIWindowCopyColorSpace");

HIWindowSetColorSpace_f = (OSStatus (*)(WindowRef,CGColorSpaceRef))

CFBundleGetFunctionPointerForName, tib, CFSTR("_HIWindowSetColorSpace");

The set of legal C programs is just not that different from the set of ASCII strings.

Gamma and Lighting Part 3: Errata

A few other random notes for working with lighting in linear space...

Light Accumulation

In order to get correct lighting, lights need to be accumulated (that is, the contribution of each light to any given pixel) in linear space. There are three ways to do this safely:
  1. Accumulate all lights in a single pass. The sum happens in your shader (which must calculate each light's contribution and add them). This is typical of a traditional one-pass forward renderer, and is straight forward to convert to linear space. The lighting calculation is done linearly in shader, converted to something with gamma, then written to an 8-bit framebuffer.

    Clamping and exposure control are up to you and can happen in-shader while you are in floating point.

  2. Multi-pass with sRGB_framebuffer. If you want to accumulate lights by blending (e.g. using blending to add more "light" into the framebuffer and your framebuffer is 24-bit RGB, you'll need the sRGB_framebuffer extension. This will cause OpenGL to not only accept linear fragments from you, but the blending addition will happen in linear space too.

    In this case, exposure control is tricky; you don't want to saturate your framebuffer, but no one lighting pass knows the total exposure. You'll have to set your exposure so that the conversion to sRGB doesn't "clip".

  3. Multi-pass with an HDR framebuffer. The other option for accumulating light by blending is to use a floating point framebuffer. This configuration (like mutli-pass with sRGB) might be typical of a deferred renderer (or a stencil-shadow-volume solution).

    Unlike sRGB into a 24-bit framebuffer, this case doesn't have a clipping problem, because you can write linear vlaues into the floating point framebuffer. You do need to "mix down" the floating point framebuffer back to sRGB later.

What you cannot do is convert back to sRGB in your shader and then use regular blending. Doing so will add lights in sRGB space, which is incorrect and will lead to blown out or clipped lights wherever they overlap.

What Color Space Is The Mac Framebuffer?

On OS X 10.5 and older, your framebuffer will have the color profile of the device, at least as far as I can tell. This probably means that the effective gamma is 1.8 (since the OS's color management should adjust the LUT enough to achieve this net result).

On OS X 10.6 every window has its own color profile, and the window manager color-converts as needed to get correct output on multiple devices. Edit: By default you get the monitor's default color profile, but you can use HIWindowSetColorSpace to change what color profile you work in.

Gamma and Lighting Part 2: Working in Linear Space

In my previous post I tried to describe the process of maintaining color sync. Two important things to note:
  • Most color spaces that are reasonable for 24-bit framebuffers aren't linear. Twice the RGB means a lot more than twice the luminance.
  • This is good from a data-size standpoint, because 8 bits for channel isn't enough to be linear.
But there's a problem: this non-linear encoding is not good if we're going to perform 3-d calculations to create computer-generated images of light sources. Note that this is not an issue of monitor calibration or which gamma curve (Mac or PC) you use; no color space with any gamma is going to be even remotely close to linear luminance. So this is almost a 'wrong format' issue and not a 'wrong calibration' issue.

Consider: light is additive - if we add more photons, we get more light. This is at the heart of a computer graphics lighting model, where we sum the contribution of several lights to come up with a luminance for an RGB pixel. But remember the math from the previous post: doubling the RGB value more than doubles the luminance from your monitor.

In order to correctly create lighting effects, we need to:
  1. Convert from sRGB to linear color.
  2. Do the lighting accumulation in linear color space.
  3. Convert back to sRGB because that's the format the framebuffer needs.
Doing this makes a huge difference in the quality of lighting. When physical lighting calculations are done directly in sRGB space, intermediate light levels are too dark (cutting the sRGB value in half cuts the luminance by a factor of five!) and additive effects become super-bright in their center. I found that I can also set ambient lighting to be lower when using correct linear lighting because the intermediate colors aren't so dark. (With intermediate colors dark, you have to turn up ambience or the whole image will be dark.)

Let the GPU Do It

The OpenGL extensions GL_EXT_texure_sRGB and GL_ARB_framebuffer_sRGB basically do steps 1 and 3 for you; when you set a texture's internal type to sRGB, the GPU converts from sRGB to linear space during texel fetch. When framebuffer_sRGB is enabled, the GPU converts from linear back to sRGB before writing your fragment out to the framebuffer. Thus your shader runs in linear space (which is fine because it has floating point precision) while your textures and framebuffer are sRGB like they've always been.*

The advantage of using these extensions on DirectX 10 hardware is that the conversion happens before texture filtering and after framebuffer blending - two operations you couldn't "fix" manually in your shader. So you get linear blending too, which makes the blend of colors look correct.

Of course, your internal art asset format has to be sRGB in order for this to work, because it's the only color space the GL will convert from and back to.

* The question of whether your framebuffer is sRGB or linear is really more a question of naming convention. If you go back 10 years, you know a few things: writing RGB values into the framebuffer probably produces color on the monitor that is close to what you'd expect from sRGB, but the GL does all lighting math linearly. So it's really sRGB data being pushed through a linear pipeline, which is wrong and the source of lighting artifacts.

Gamma and Lighting Part 1: Color Sync

The topic of color spaces, gamma, and color correction gets complex and muddled quickly enough that I had to delete my post and start over. Fortunately a number of folks on the Mac OpenGL list set me straight. This first post will discuss how to control color when working with textures; the second one will address how to light those textures in 3-d.

This FAQ explains Gamma infinitely better than I possibly can. I suggest reading it about eight times.

24-bit Color Spaces Are Pretty Much Never Linear

A color space defines the meaning of RGB colors (that is, triplets of 8-bit or more codes). How red is red? That's what your color space tells you. And the first thing I must point out is: no color space you'd ever use on an 8-bit-per-channel (24-bit color) display is ever going to have a linear mapping between RGB values and luminance. Why not? This Wikipedia article on sRGB puts it best:
This nonlinear conversion means that sRGB is a reasonably efficient use of the values in an integer-based image file to display human-discernible light levels.
Here's what's going on:
  • Our perception of color is non-linear. It's roughly logarithmic. This means we're more discerning of light level changes at low light levels. That's good - we can see in the dark, sort of, or at least this improves our dynamic range of light perception.
  • 256 levels of gray isn't a lot, so it's important that the levels be spread out evenly in terms of what humans perceive. Otherwise we'd run out of distinct gray levels and see banding.
  • Therefore, to make 24-bit images look good, pretty much every color space including sRGB is radically different from linear luminance levels.
When we discuss images with different gamma below, let's just keep this in the back of our mind: pretty much any sane color space is going to have some gamma if it will be used for 24-bit images, and that's good because it lets us avoid banding and other artifacts. It will also become important when we get to lighting.

Part of the confusion over gamma correction is that CRTs have non-linear response by nature of their electronics. People assume that this is bad and that gamma correction curves make the RGB->luminance response linear. In fact, that response is actually useful, and that's why the gamma correction curves on a computer typically undo some of the non-linear response. For example, Macs used to bring the gamma curve down to 1.8 (from 2.5 for a CRT) but not all the way down to 1.0. As mentioned above, if we really had linear response between our RGB colors and luminance, we would need more than 8 bits per channel.

Consistent Color

If we are going to develop a simulator that renders using OpenGL onto a PC or Mac, we're going to work in a color space, probably with 24-bit color most of the time. (We might use higher bit depths in advanced rendering modes, but we can't afford to store our entire texture set in a higher res. Heck, it's expensive to even turn off texture compression!)

So the important thing is that color is maintained consistently through the authoring pipeline to the end user. In practice that means:
  1. Artists need to have their monitors correctly calibrated. If the artist's monitor isn't showing enough red, the artist paints everything to be too red, and the final results look Mars-tacular. So the artist needs good "monitoring."

  2. The color space needs to be tracked through the entire production process. Whether this happens by simply declaring a global color space for the entire pipeline or by tagging assets, we need to make sure that red keeps meaning the same red, or that if we change it, we record that we've changed it and correctly adjust the image.

    When working with PNG, we try to record the gamma curve on the PNG file. We encourage our artists to work entirely in sRGB.

  3. The graphics engine needs to not trash the color on read-in. Similar to the pipeline, we need to either not mess up the colors, or track any change in color space. In X-Plane's space, X-Plane will try to run in approximately the color space of the end user's machine, so X-Plane's texture loader will change the color space of textures at load time.

  4. The graphics engine needs to cope with a user whose display is atypical (or not). Finally, if the user's display is not in the same color space as the production pipeline, the engine needs to adjust its output. (Or not adjust its output and expect the user to fix it locally.) In X-plane's case we do this by using the user's machine's gamma curve and correcting images at load time.

    This is not the best solution for color accuracy; it would be better to keep the images in their saved color space and convert color in-shader (where we have floating point instead of 8-bits per channel). Historically X-Plane does the cheap conversion to save fill rate. We may some day change this.

(Note: pre-converting color is particularly a poor idea for DXT-compressed textures, because you can adjust the pair of 565 key colors in each block but you can't adjust the weightings, which are clamped at fixed mix-points based on the DXT specification. So the mid-tone colors for each block will be distorted by trying to color correct a compressed image.)

If all of these steps are taken, the end user should see textures just as the art team authored them. Before we had step four in X-Plane, PC users used to report that X-Plane was "too dark" because their monitor's color response wasn't the same as our art team's (who were using Macs). Now they report that X-Plane is too dark because it's dark. :-)

In my next post I'll discuss the relationship between non-linear color spaces (that work well in 24-bit color) and 3-d lighting models.

Thursday, November 11, 2010

The Very Best Seventies Technology

I do not like Lisp. Actually, that's not quite correct. I despite Lisp the way Red Sox fans despise the Yankees and Glenn Beck despises fiat money. (BTW there has never been a better time to buy gold. Srsly.) Lisp is not a computer language; it is a cult that warps the minds of otherwise capable programmers and twists their very notion of what a program is.*

So it is in this context that I began a conscious campaign to reduce the number of parenthesis in my C++. I know, some of you may be saying "say what you mean, understand what you say" and all of that feel-good mumbo jumbo. Heck, my own brother told me not to minimize parens. But I can't have my carefully crafted C++ looking like Lisp. It just won't do.

So slowly I started to pay attention to operator precedence, to see when I didn't actually need all of those "safety" parens. And here's what I found: 95% of the time, the C operator precedence makes the easy and obvious expression the default. I was actually surprised by this, because on the face of it the order looks pretty arbitrary.

If you squint though, you'll see a few useful groups:
  • Unary operators before binary operators.
  • Math before comparison.
  • Comparison before anything if-like (e.g. && which is more like control flow than an operator).
  • Assignment at the bottom.
There's just one rub: comparison is higher precedence than bit-wise binary operators, which is to say:
if( value & mask == flag)
doesn't do what you want. You have to write the more annoying:
if((value & mask) == flag)
So what went wrong? It turns out there's a reason!

Approximately 5,318 years ago when compilers were made out of yarn and a byte only had five bits, C was being built within the context of B and BCPL. If you thought C was cryptic and obtuse, you should see B and BCPL. B is like C if you removed anything that might tell you what the hell is actually going on, and BCPL looks like you took C, put it in a blender with about 3 or 4 other languages, and played "will it blend". (Since BCPL is "no longer in common use", apparently the answer is no. But I guess they couldn't have thought it looked like a C blend at the time, as C hadn't been invented.)

Anyway, in B and BCPL, & and | had a sort of magic property: inside an if statement they used lazy evaluation (like && and || in C/C++) - they wouldn't even bother with the second operand if the first was true or false. So you could write things like this: if(ptr & ptr->value) safely. But you could also write flag = ptr & 1; to extract the low bit.

In a rare moment of preferring sanity over voodoo, Dennis Ritchie chose to split & and | into two operators: | and & would be bit-wise and always evaluate both operators, while && and || would work logically and short-circuit. But since they already had piles of code using & as both, they had to keep the precedence of & the same as in B/BCPL (that is, low precedence like &&) or go back and add parens to all existing code.

So while & could be higher precedence, it's not for historical reasons. But have patience; we've only had to live with this for 38 years. I am sure that in another 40 or 50 years we'll clean things up a bit.

* A program is a huge mess of confusing punctuation. Something clean and elegant like this: for(;P("\n"),R=;P("|"))for(e=C;e=P("_"+(*u++/ 8)%2))P("|"+(*u/4)%2);

Wednesday, November 10, 2010

Finding Mom and Dad

I was looking to use generic programming to clean up some otherwise picky code today, but I fear I can't find a way to do it that doesn't cost me storage. (Well, that's not 100% true, but I digress.)

There may be cases where I have an object O with sub-parts S1, S2, etc. where the implementation of S1/S2 could be made more efficient if part of their data could be pushed out to O. An example would be an object containing several intrinsic linked lists. The linked lists can be made faster by maintaining a stack of unused nodes, like this:
struct node {
node * next;
...
};
struct O {
node * S1_list_head;
node * S2_list_head;
node * free_stack_top;
};
Our object contains two lists, but only one common stack pool.

We can use templates to safely hide away the maintenance of this code, sort of, e.g.
template
void pop_front(node *& head_list, node *& free_list)
{
* k = head_list;
head_list = head_list->next;
k->next = free_list;
free_list = k;
}
There's just one problem: if we call pop_front and reverse the argument order for our head and free list, we are going to create a world of hurt. What we really want is something like S1_head_list.pop_front();

What I can't seem to find is a way to turn S1 and S2 into objects that have knowledge of their contained parts. In the above case, we would have to template S1 and S2 based on their byte offset from themselves to the location of their common section in the parent. That's something the compiler knows, but won't tell us, and we don't want to figure out by hand.

The best real alternative I suppose would be to wrap the templated list function in something inside O and then make the lists private, limiting the number of cases where a usage error of the template functions can occur.

The traditional work-around for this would be to include a pointer to O inside S1 and S2. This is clean and fool-proof, but also increases the storage requirements of S1 and S2; if we are storage sensitive to O, this isn't acceptable.

MediaWiki and ModSecurity

We were seeing 500 Internal Server Errors:
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster@xsquawkbox.net and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
My host finally figured out what was going wrong. This was in /www/logs/error_log:
Wed Nov 10 12:48:15 2010] [error] [client 71.248.161.106] ModSecurity: Access denied with code 500 (phase 2). Pattern match "(insert[[:space:]]+into.+values|select.*from.+[a-z|A-Z|0-9]|select.+from|bulk[[:space:]]+insert|union.+select|convert.+\\(.*from)" at ARGS:wpTextbox1. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "355"] [id "300016"] [rev "2"] [msg "Generic SQL injection protection"] [severity "CRITICAL"] [hostname "www.xsquawkbox.net"] [uri "/xpsdk/mediawiki/index.php"] [unique_id "TNra30PhuxAAAE8SUkMAAAAQ"]
Whoa. What is that? The server has ModSecurity installed, including a bunch of rules (as defined by regular expressions) designed to reject, um, bad stuff. The rule seems to come from here and MediaWiki isn't the only program that it can hose.

If you pull apart the regular expression, you can see how things go wrong. Loosely speaking the rule matches text in this form:
insert ___ into ___ values|select ___ from ___ from ___ insert|union ___ select|convert
where the blanks can be anything, can pipe indicates that either word is acceptable. So...insert, into, values, from, form, insert, convert. Those words appear in that sequence of comments in my OpenAL sample. And frankly, it's not a very remarkable sequence, hence it matching this.

So I thought the problem was long posts, but it wasn't. The longer the post, the more likely that a particular sequence of words would show up.

From what I can tell, white-listing URLs from the rule is the "standard" fix.

Monday, November 08, 2010

OpenAL on Three Platforms

OpenAL is a cross-platform 3-d sound API. It is not my favorite sound API, but it is cross-platform, which is pretty handy if you work on a cross-platform game. Keeping client code cross-platform with OpenAL is trivial (as long as you don't depend on particular vendor extensions) but actually getting an OpenAL runtime is a little bit trickier. This post describes a way to get OpenAL on three platforms without too much user hassle.

OS X

On OS X things are pretty easy: OpenAL ships with OS X as a framework dating back to, well, long enough that you don't have to worry about it. Link against the framework and go home happy.

Well, maybe you do have to worry. OpenAL started shipping with OS X with OS X 10.4. If you need to support 10.3.9, weak link against the framework and check an OpenAL symbol against null to handle a missing installation at run-time.

Linux

On Linux, OpenAL is typically in a shared object, e.g. libopenal.so. The problem is that the major version number of the .so changed from 0 to 1 when the reference implementation was replaced with OpenAL Soft. Since we were linking against libopenal.so.0, this broke X-Plane.

My first approach was to yell and complain in the general direction of the Linux community, but this didn't actually fix the problem. Instead, I wrote a wrapper around OpenAL, so that we could resolve function pointers from libraries opened with dlopen. Then I set X-Plane up to first try libopenal.so.1 and then libopenal.so.0.

(Why did the .so number change? The argument is that since the .0 version contained undocumented functions, technically the removal of those undocumented functions represents an ABI breakage. I don't quite buy this, as it punishes apps that follow the OpenAL spec to protect apps that didn't play by the rules. But again, my complaining has not changed the .so naming conventions.)

Windows

The Windows world is a bit more complicated because there are two separate things to consider:
  • The implementation of OpenAL (e.g. who provided openal32.dll).
  • The renderer (e.g. which code is actually producing audio).
Basically Creative Labs wanted to create an ecosystem like OpenGL where users would have drivers installed on their machine matching specialized hardware. So the Create implementation of openal32.dll searches for one or more renderers and can pass through OpenAL commands to any one of them. The standard OpenAL "redistributable" that Creative provides contains both this wrapper and a software-only renderer on top of DirectSound (the "generic software" renderer).

OpenAL Soft makes things interesting: you can install OpenAL soft into your system folder and it becomes yet another renderer. Or you can use it instead of any of the Creative components and then you get OpenAL soft and no possible extra renderers.

Now there's one other issue: what if there is no OpenAL runtime on the user's machine? DirectSound is pretty widely available, but OpenAL is not.

Here we take advantage of our DLL wrapper from the Linux case above: we package OpenAL Soft with the app as a DLL (it's LGPL). We first try to open openal32.dll in the system folder (the official way), but if that fails, we fall back and open our own copy of LibOpenAL Soft. Now we have sound everywhere and hardware acceleration if it's available.

One final note: in order to safely support third party windows renderers like Rapture3D, we need to give the user a way to pick among multiple devices, rather than always opening the default device (which is standard practice on Mac/Linux). This can be done with some settings UI or some heuristic to pick renderers.

Friday, October 08, 2010

Why GPU Sliced Shadows Fail For Clouds

I have discovered through experimentation that NVidia's technique for self-shadowing particle volumes (found here) doesn't work well for a flight simulator cloud system. When reading a white paper, it can be hard to judge the appropriateness of an algorithm for a particular application; here's what went wrong in our case.

The Basic Algorithm

The basic algorithm is something like this:
  1. Sort the particles to be directional for both the light source and the viewer. (This can require rendering front-to-back to the viewer at times.)
  2. Along this direction, slice the particles up. For each slice, plot first, then update our shadows.
  3. Composite the finished system to screen (necessary if we are going front-to-back).
The algorithm produces nice soft self-shadowing because the shadow texture is being incrementally updated as we move through the slices.

The algorithm does work well; for a test case with a cloud built to meet the algorithm's requirements, the shadows were soft, real-time, and quite plausible.

Performance Bottlenecks

The algorithm has two basic performance bottlenecks:
  • Like all over-drawn particle system algorithms, it is fill rate limited if we overlap too many particles.
  • Slicing requires finishing rasterization to a texture and then using the texture, so the algorithm is bound by the number of slices. (The slicing can affect both time spent in the driver rebuilding the pipeline, including costs of changing the render target, and it can stall depending on how smart your driver is about requiring pending rasterization to complete.)
The paper points both of these out and notes that the number of slices may have to be traded for performance.

Overdraw and Alpha

The algorithm is a little bit mismatched to a flight simulator cloud system because a flight simulator cloud system typically uses a smaller number of more opaque cloud particles to avoid fill-rate issues. This causes problems because the algorithm doesn't naturally diminish self-shadowing; it depends on the fact that we haven't accumulated a large number of particles to keep shadows very light when two particles are near each other.

So the first problem in general use is that the quality of the shadows fights with the optimization of relatively opaque particles. As soon as we make fewer, smaller, more opaque particles (which can be coped with via texturing) the quality of the shadows becomes quite poor.

Slicing and Bucketing

The second problem is that for a general large-scale particle field we need some kind of bucketing, and this fights with slicing. We want to break our particles into a bucket grid for two reasons:
  • It gives us a way to rapidly cull a lot of particles.
  • The bucket grid has a traversal order that is back to front, so we only need to Z-sort within a bucket, saving a lot of sorting time.
The problem is this: we don't know the relationship spatially between slices of different buckets, so we need to slice within a bucket, but do this for each bucket on screen. So if we have 12 buckets on screen, we have 12x the number of slices.

Slices are really quite expensive due to the GPU setup overhead, and even a small number of buckets means that we can't afford enough slices. NVidia recommends 32-128 slices, but with buckets, you'll be lucky to get 8 slices per bucket.

Low Slice Count = Ugly

It goes without saying that having a small number of slices is going to produce less correct shadows. But there is another, more serious problem: as you rotate the camera, the slicing plane changes. Nearby particles that are in the same plane will not shadow each other, but when/how this happens is a function of how wide the slicing plane is and which way it goes.

What this means is: as we rotate the camera, some particles will suddenly stop shadowing each other as the slicing planes rotate, causing noticeable popping artifacts.

The really bad artifact comes when we go from having the sun slightly facing to us to slightly facing away from us. At that point the algorithm will switch between back-to-front and front-to-back rendering, and the slicing plane will jump by 90 degrees almost instantly. This produces a huge number of artifacts when the number of slices is small.

Summary

The algorithm fails when:
  • We have mostly opaque particles and
  • We can't afford enough slices and
  • There are external constraints (like culling) artificially "wasting" slices.
Unfortunately, that is us...so...on to other techniques.

Thursday, October 07, 2010

Alpha Blending, Lets Try Again

A while ago I posted this convoluted mess of recipes for blending back to front and front to back. I've had some time to revisit the code, and the actual formulas are simpler than I realized and more consistent; they also don't require split blending functions for the back to front composited case, which is nice if you want to run on, well, on dinosour hardware (since pretty much anything you can find has split blending functions from the Radeon 8500 on).

Premultiplied Alpha

The goal is to composite several translucent textures together, and then composite them over our scene as if the whole scene had been drawn in order. In order to make this work, we want to use premultiplied alpha - that is, textures where the RGB color has already been made 'darker' if the alpha channel is not 1.0. In this scheme our blend function can be (1.0, 1.0 - SA) instead of the normal (SA, 1.0-SA) because the source pixel is already multiplied by SA. That would be the premultiplication.

Why is premultiplication a good idea? We have to solve the problem of "what is under translucent", and premultiplication does that. In a premultiplied texture, the RGB channel becomes more black as it becomes more transparent, and thus "nothing" has a valid color representation (black). In a traditional texture, there is color behind transparent, and that can cause sampling artifacts.

So our goal is to composite a premultiplied texture. That means that the "clear" will be 0,0,0,0 (black, transparent). Note that while the color is black (meaning nothing to add color-wise) we still need that alpha channel to be 0 (transparent) too to tell us that the background won't be occluded.

Fixing Back to Front

If you have ever blended together a bunch of geometry (back to front) and then composited the result on top of something else, you know that the alpha channel for that back-to-front geometry is going to be pretty screwed up. To see the problem, imagine blending a really light (10% alpha) screen over an already opaque scene. That light screen will (by a "strength" of 10%) move the alpha channel away from opacity and toward translucency. The problem is that the alpha blends itself, and we don't want that.

It turns out that pre-multiplied alpha can fix this. We set our blending equation to (1.0, 1.0-SA) and we pre-multiply our RGB. Our alpha will now be the old alpha (lightened by the amount the new alpha is "covering it") plus the new alpha, but not lightened.

To take the case of a 10% screen over an opaque scene, the alpha will be 0.1 * 1.0 + 1.0 * (1.0 - 0.1), which gives us...1.0, which is exactly right: blending over an opaque object doesn't make it translucent.

Front to Back

For the front to back case, we still want to use pre-multiplied alpha, but we set our blend factors to (1.0-DA, 1.0). With the back to front case in "pre-multiplied" form, this should look very symmetric. In fact, all we're doing is changing which one is the "master" (whose alpha cuts down the other" and which is not).

What effectively happens is:
  • The less alpha is in the buffer already, the more you get to draw (ehnce 1.0-DA as a factor).
  • The buffer is never reduced in color (which makes sense, since you can't darken something by drawing behind it).
  • The amount of alpha opacity you leave behind/add-in is also reduced by what is already there (you matter less if you are behind something translucent).

Wednesday, October 06, 2010

Premultiplication: Pros and Cons

I realized today that premultiplied alpha could fix a nasty artifact that we sometimes get in X-Plane: "tree ring".*

The bug is this: imagine you have two texels in your texture. The left one is transparent, and the right one is opaque green (a tree). What is the RGB "behind" the transparent one? Let's call it junk.

When this texture is sampled with linear filtering, the graphics card will do the wrong thing: it will blend the two texels by channel to come up with a texel sample that is a mix of green + junk in the RGB channel and a translucent alpha channel. Thus at the edges of our alpha-blended tree, we will see a 'ring' of junk leaking into the texture.

The traditional work-around (and the one we use for X-Plane) is to ensure that the RGB behind the transparent parts of the texture contains something valid that we wouldn't mind seeing, e.g. "more green". This is not an ideal work-around because Photoshop will put white in this space when alpha reaches 0%, so most artists will have to manually fix this problem over and over (and it's not an easy problem to see since the erroneous color is behind a 0% alpha pixel).

If we used pre-multiplied alpha, this would not be a problem. With premultiplied alpha, the RGB pixels are already multiplied by the alpha channel; thus the transparent pixel is by definition black (0% alpha * any RGB = 0,0,0 = black). Thus when we blend green and black we get "darker green", which is the appropriate pre-multiplied color for a linear sampling at the edge of our tree. Simply put, premultiplying puts the alpha multiply before linear interpolation, which i what we want.

Compression?

I can think of a possible reason to not use pre-multiplied alpha in production art assets: texture compression. If I have a solid green tree with an alpha channel, my texture compressor uses all of its "color bits" to get that green color right. But if I premultiply, those color bits are now storing both the color and the effect of alpha (the darkening). I may get some color distortion on my tree because the compressor is trying to get the pre-multiplied alpha right.

In other words, a non-premultiplied texture may compress better. Ideally I'd like my compressor to be alpha-aware, that is, optimize the color under the opaque part at the expense of what is under the transparent part.

The Rest Of the Story

Obviously we're not going to change X-Plane to premultiplication given so many art assets out there. But there is more to the story too.

The * up there is that there is a second, significantly worse cause of "rings" on trees: z-buffer artifacts. The z-buffer doesn't handle translucency very well (and by that I mean it doesn't handle it at all). If our trees contain translucent edges due to linear filtering, we get Z put down over the translucent parts, and that cuts out any 3-d building or additional trees behind them. The result is "blue rings" where the sky shows through what should be a forest.

The solution is the one we use in practice: we turn off blending entirely and simply test the texels - they are in or out. We still use linear filtering though, so that the alpha edge of our tree isn't square and jagged, so we would see a ring if we have bogus color underneath the transparent parts of the trees. Since in practice we almost always ship DXT compressed textures, the compression argument against pre-multiplication holds.

Monday, September 27, 2010

When Good Floating Point Goes Bad?

We have a handful of linear algebra/geometry routines in X-Plane that simplify writing geometric tests. In their heart, they almost all turn into dot products. So: when can we not trust them?
  • For real floating point data, dot products may not go to zero; the zero dot product is at the heart of "is this a left or right turn" and "which side of a line am I on". Rounding errors may force points to fall to one side of a line or another.
  • Even more weirdly, there's no guarantee that points will consistently fall on one side of a line or another; the rounding errors need to be treated as effectively random. A point that is moving across a line may give 'jittery' results as the point slowly crosses the line.
  • Order matters. Given the same theoretical line, defining it by swapping the input points (e.g. line AB instead of BA) may have unpredictable effects on side-of tests for very small distances.
  • Finally, there is no function more hellish than intersection. The more parallel the lines, the more completely insane the intersection results become. Serious paranoia is advised when dealing with intersections, because the routines can give you a positive result ("yes there was an intersection") with lunatic results ("and the intersection is on Mars"). I usually cope with this by using a dot product test to case out the near-collinear case, which is then handled in a different algorithm that doesn't require clean intersections.