Friday, January 28, 2011

Is COLLADA a Win?

I'm always astounded to discover that anyone, like, reads this blog. But on the off chance that anyone with serious tool-chain/content-pipeline experience is reading this...

Is COLLADA a win?

Like most games, X-Plane has the problem of needing to get content from commercial 3-d modeling programs into our proprietary engine format, with annotated data attached that is specific to X-Plane. We need to give our artists a way to attach such data (E.g. billboard properties, hard surface attributes, etc.) natively in their 3-d program and have that data make it into X-Plane.

In fact, the problem is a bit worse for X-Plane because we are effectively an open platform for a whole range of third party developers; thus the world of artists are not on any one 3-d program.

There are three ways I can see to solve this problem:
  1. Write a lot of export scripts. This is the path we're on now. We have full featured scripts for AC3D and Blender, and there are a lot of other scripts out there for other modelers.

    The obvious problem with this approach is scalability. Every new modeling feature in X-Plane has to be separately built into every single exporter; the result is invariably inconsistent support for the "full" file format, due to high development costs. (I maintain one of the exporters, the AC3D one myself, and I am not up-to-date on my own modeling formats.)

  2. Create a common, simple, proprietary interchange format. One of the problems with writing X-Plane models is that you have to optimize them to maximize sim performance. The idea of creating a more simple format that feeds into a processing tool would be to lower the cost of writing the actual modeling-program-specific export scripts. Exporters would simply dump out a stream of "stuff" and the post processing tool would clean it.

    We already do this with DSF, our scenery file format. DSF is a horribly complex bit-packed format, but a tool (DSF2Text) will convert a simple text stream to the final binary using LR's libraries to do the compression and encoding. While the DSF code itself is open source, a text file represents an easier API for a wide variety of languages, including scripting languages.

  3. Use an off-the-shelf interchange format, hence the question about COLLADA. In theory, the win would be that there would be existing export scripts for the interchange format, greatly reducing the time to implement support for a particular modeler. A common COLLADA -> OBJ converter would then do the final encode once for all programs.

    In practice, the devil would be in the details: COLLADA is a very general, rich format; do all modelers support exporting to all COLLADA idioms? Would there be appropriate 1:1 mappings from the 3-d program to X-Plane?

    My concern is that it's bad enough trying to find ways to represent X-Plane concepts in a 3-d program; in order to use existing COLLADA code those concepts would have to be in the 3-d program in a way that is compatible with existing COLLADA export code.

Anyway, if you have experience (good or bad) with using COLLADA as an intermediate tool-chain step, I'd love to hear about it; it strikes me as an option for gaining leverage over tool-chain costs whose real value would be entirely determined by the details of implementation.

Thursday, January 20, 2011

Derivatives III: I Ran Out of Rez

One more note on derivatives in GLSL shaders: derivatives can run into precision problems that the underlying expressions don't have.

Recall that derivatives are typically calculated by taking the actual difference of two values in two nearby pixels. This means that the derivative is subject to the precision limits between two pixels.

Consider a UV map spread over a really huge distance, say, 5 km in-game. The texture is 1024 x 1024 and therefore each texel is about 10m in size.

What happens if we zoom way the heck in so that one game pixel (5m) is covering nearly all of the screen?

We need about 10 bits of precision to select our texel; any remaining precision can be used to take an interpolated position between texels for filtering. If our monitor res is about 1024x768, we're using 20 bits of precision in our UV map, and we have only three bits left. If we zoom in any more, we may reach a point where our interpolated UV map doesn't have enough precision to provide a UV position for each pixel.

(In other words, if we have less than 10 bits of precision left between the left and right side of the screen, then some adjacent pixels will have the same UV coordinates!)

Now this generally doesn't matter for texture sampling. We're sampling 1024 unique mixes between two pixels, and we can only show 256 shades on the screen - in practice if two pixels have the same UV coordinates, it doesn't matter, because the amount of RGB change per pixel is not perceivable anyway.

But for our derivatives, it's a different story: some pairs of pixels will have a zero derivative, and some will have a non-zero derivative! Even if we don't run out res, we're very low on res and our derivatives may be 'chunky' or in other ways screwed up. If we need to reconstruct basis vectors from our derivatives, those basis vectors are going to be a train wreck.

The only solution I have found to this problem has been to replace GLSL built-in derivatives with an algorithmic derivative function. Fortunately the only cases where we have ridiculous UV mapping is when the texture coordinates are generated by formula, and thus a similar formula can be used to create the derivatives.

Derivatives II: Conditional Texture Fetches

In my previous post I described how OpenGL often calculates derivatives by differencing nearby pixels in a block. This can cause problems if our UV map has discontinuities.

Even weirder things happe if we use texture fetches inside an if statement. For example, this will produce some very weird results:
if(uv.x >= 0.0)
gl_FragColor = texture2D(my_sampler,uv);
else
gl_FragColor = vec4(0.0);
You might think that if you use a texture a ramp of black on left, white on right, you'd get a ramp of texture and then the black texture would seamlessly transition into the hard-coded black from the else statement.

If your GPU and GLSL compiler are in a forgiving mood, this may work; if they are not, you may get a set of mid-gray artifact pixels at the transition point. The problem is this bit of fine print (from the GLSL 1.20.8 spec, section 8.8):
The method may assume that the function evaluated is continuous. Therefore derivatives within the body of a non-uniform conditional are undefined.
You can't take a derivative inside an if statement. (But since the results are undefined, the GPU can make your life more difficult by sometimes giving you useful results anyway. ) Recall from my past post that a texure2D fetch is like a texture2DGrad with derivatives of the texture coordinate expression. Since the derivative functions are invalid inside if statements, the derivatives passed to texture2D may be junk. In other words, this is bad:
if(stuff)
gl_FragColor = texture2D(tex,uv,dFdx(uv),dFdy(uv));
but this is okay:
float dx = dFdx(uv);
float dy = dFdy(uv);
if(stuff)
gl_FragColor = texture2DGrad(tex,uv,dx,dy);
In other words, you have to use texture2DGrad to move the derivative calculation out of the if statement.

Why Can't the GPU Get This Right (Except When It Does)

Artifacts due to incorrect derivative calculations inside incoherent texture fetches (that is, some pixels texture fetch, nearby ones don't, the derivative is hosed, and our texture fetch is therefore hosed) are definitely sensitive to the hardware, GLSL compiler, and driver, and I ended up switching out my Radeon and GeForce about 30 times before I wrapped my head around this issue.

This doesn't surprise me. The spec allows undefined behavior. Recall that the derivative is based on differencing the value of an expression across a 2x2 pixel group. To understand why conditionals and derivatives don't mix, we have to understand how modern GPUs handle conditional rasterization.

(What follows is based on my reading some docs on R700 assembly; it is best to think of it as a model for how GPUs can work, more or less; I am sure there are lots of subtleties to the R700 that I don't understand.)

The GPU rasterizes pixels in 2x2 blocks, with the same shader executed on four execution units in lock-step. That is, each pixel has its own intermediate registers and state, but all four pixels run the same instructions.

When the shader hits an if statement, the hardware sets a mask for each pixel indicating which pixels are "in" the if statement and which are not. The entire if statement is run on all hardware, but the results for the pixels that are not in the if statement are thrown out due to the mask.

If all four pixels hit the if statement the same way, only then can the GPU jump over the if statement, saving actual work.

So what happens if the if statement is being evaluated for some pixels and not others and we take a derivative? The answer is: lord knows! The expression we are calculating may be only partly updated, incorrect, or totally unavailable for some of the pixels.

Branch Coherence

As a side note, the property of the GPU to run the entire shader on all pixels when only some of them are using the if statement is why the GPU manufacturers will tell you that a conditional is only a performance win if it is coherent - that is, if nearby pixels all branch in the same way. This is because when nearby pixels branch in different ways, the GPU must run all code and throw out some of the results.

Derivatives I: Discontinuities and Gradients

The short of it is this: if you see 2x2 pixel artifacts in your shader, you might need texture2DGrad. Now the long version.

How does OpenGL know what mipmap level to use when you sample a texture in your GLSL shader with texture2D? The answer is that this:
texture2D(my_texture,uv);
actually does something like this:
texture2DGrad(my_texture,uv,dFdx(uv),dFdy(uv));
In other words, texture2D takes the derivative of your input texture coordinates and uses those derivatives to decide which mipmap level to access. The larger the derivatives, the lower mipmap level. (The actual implementation is more complicated.)

Before continuing, a brief exercise in visualization. Imagine a cube with a single square face visible to us (parallel to the screen). The cube face is textured with a single 256x256 texture. If we zoom the camera so that the cube takes 256x256 screen pixesl, the derivative of the UV map between any two pixels on screen is about 1/256 in both directions, and we want the highest level mipmap. If we zoom out so that the cube takes up only 2x2 pixels, the derivative is about 1.0 in both directions - and we want the lowest mipmap level.

Where Do Derivatives Come From?

The GLSL derivative functions are usually implemented by differencing - that is, the GPU takes a block of 2x2 pixels and differences the variable or expression passed to dFdx and dFdy, to calculate an 'approximate' derivative. Many GPUs rasterize 2x2 clusters of pixels at a time, with the shader instructions for the four pixels run in lock-step, so the hardware can be set up to efficiently "cross" the four texels to find our derivatives.

This means that if there is a discontinuity between those pixels, the derivative may be, well, surprising. For example, consider something like this:
vec2 uv = gl_TexCoord[0].st;
if(uv.x > 0.5) uv.y += 0.25;
gl_FragColor = texture2D(my_sampler, uv);
What happens if two of the pixels in our 2x2 block have uv.x > 0.5 and the other two don't? well, the answer is that uv.y will be 0.25 bigger for some but not all textures, and the derivative of uv.y will be very big! This in turn will cause texture2D to fetch a low mipmap level, much lower than any other 2x2 pixels that are "coherent". (Coherent here means all 4 pixels have the same boolean answer to the if conditional.)

One way to think of this is: since the derivatives are found by looking at actual pixels on screen, a discontinuity is seen by the derivative function as a really low-res UV map, and thus a low mipmap level is selected.

Fixing The Derivative

So what can we do? We can provide OpenGL with an expression whose derivative is about the same as our real texture coordinates, but without discontinuities. For example, we can rewrite our above example like this:
vec2 uv = gl_TexCoord[0].st;
if(uv.x > 0.5) uv.y += 0.25;
gl_FragColor = texture2DGrad(my_sampler, uv,dFdx(gl_TexCoord[0].st),dFdy(gl_TexCoord[0].st));
Our actual texture samples come from a discontinuous UV map, but our derivative comes from the original continuous function.

Breaking Continuity

I first ran across this while working on the 'tile' shader for X-Plane 10. The tile shader breaks each texture into a sub-grid of tiles and then randomly swizzles the tiles, like a number puzzle that someone has been scrambled. The tile shader hides repetitions in the shader, and (because it runs in shader) it doesn't require additionally tessellating geometry, saving vertex count.

(Using fragment ops to save vertex count might seem strange, but in this case, our base mesh is already heavily cut up based on other criteria; having the texture swizzle run orthogonally lets us subdivide the mesh based on other, unrelated criteria.)

Without texture2DGrad, we would get a set of 2x2 pixel dark pixels at the edge of the tiles. The tiles are induced via some math that includes a floor() function to separate our tile number from our location within the tile. The floor function can induce discontinuities even without conditional logic, because floor is not a continuous function.

Saturday, January 08, 2011

Stupid CVS Tricks

I finally figured out (thanks to this) how to get CVS to notify us somewhere other than the mail service of the server it's running on. See the link for instructions, but basically you can make a 'users' dictionary file that maps users to custom external email addresses...the file isn't in the default config (which is weird).

Also, note that the CVS watch command can do two things:
  1. It can subscribe to events (edit, unedit and commit). When you watch add yourself with some or all of these events, you get email (looked up via users). Since watches go to specific subscribed users, the CVS notify file uses a wildcard to send to specific users. (This is different from loginfo which sends to a list of everyone who cares about any commit for a given module.)
  2. It can force a file to be checked out locked (thus forcing an edit/unedit workflow) using cvs watch on. We only use this to force our X-Code project file to be checked out locked (to prevent the accumulation of lots of trivial project changes).

Tuesday, January 04, 2011

CAS and Reference Counting Revisited

A while ago I suggested that we can't use atomic compare and swap (CAS) and reference counting together to update an arbitrary data structure because we can't atomically dereference the pointer and increase our reference count at the same time. The problem is that there is an instant while we are 'off the end' of the pointer that we haven't increased our reference count; an updater would have no idea that throwing out the copy of the data we are using is a poor idea.

Here's possible work-around: use the low bit of the pointer we want to CAS as a "lock" bit and spin. The algorithm would go something like this:
read begin:
while(1){ // this spins if someone else is trying to
ret = ptr // read-begin.
if(ptr, ptr | 1)) break
}
atomic_inc(&ptr->ref_count);
CAS(ptr, ptr & ~1);
return ret

read_end(data)
if atomic_dec(&data->ref_count)==0)
delete data

update:
while(1)
old = read_begin // take a ref count because we are copying from data
create new copy from old
if CAS(old,new){
assert(atomic_dec(&old->ref_count) > 0);
read_end(old) // two decs,
break
} else {
read_end(old) // we failed to swap, retry.*
delete new }
The idea here is that we can enforce spinning for a short time on other readers while we read the pointer to our block of data by always CASing in the low bit. (This assumes that memory is at least 2-byte aligned, which is an acceptable design assumption on pretty much all modern machines.) Thus we are holding a spin lock while we get our reference count registered.

This code assumes that the data we are protecting has a baseline reference count of one - thus once an updater has replaced it, the updater removes this 'baseline' count, and the last reader to stop counting it (and an updater is a reader) releases it.

One reason why I only come back to RCU-style algorithms every few months is that (at least for this one) there's no way to block to ensure that the old copy of the data has been fully released. Knowing that your update is "fully committed" to all thread contexts is an important property that this algorithm does not have.