Friday, May 26, 2006

Logging and Divide and Conquer for Performance

There are only two true debugging techniques: log intermediate results and divide and conquer the buggy code. They work well for performance tuning too.

For performance-tuning X-Plane our primary tool is profiling. Some of this is done by manually logging timing points (read off high-fidelity counters). But more useful is Shark, Apple’s adptive sampling profiler.

One of the trickier things about performance-tuning an OpenGL application is that its speed is affected by both CPU and GPU. The nice thing about Shark is that since it samples over time (and not per function call), our framerate doesn’t decrease when we use it. If our framerate decreased, the ratio of CPU to GPU work would change and the profile would be invalid. Also, Shark can sample within a function, which is crucial since we inline very heavily in our tight loops.

Good profiling is critical to performance; we can make almost anything fast but we don’t have time to make everything fast. And what’s slow is rarely what you would think is slow. For example, I just did a profile of 8000 cars to determine whether we can render the full 3-d headlights and taillights at a distance. Surprisingly, the cost of setting up the lights in 3-d is almost nil; assuring that the car is not unnecessarily drawn turns out toe be the performance-critical factor. (Given how many more lights there are than cars, since cars themselves are culled out when far away, the fact that the 3-d math isn’t a hot loop is surprising!)

In this picture you can see a Shark profile of X-Plane where we’re pushing back a lot of items onto a vector that hasn’t been pre-alloated. Thus OS vm functions and reserve() take up 59% of CPU usage and represent all dominating function calls. In a healthy X-Plane, plot_group_layers (our scene-graph iterator) and the various GLEngine calls would dominate.

One problem with profiling is that if you can’t duplicate the exact rendering settings, you can’t safely compare techniques. To determine the cost of a feature, we divide and conquer. For example, to understand what really costs us - the car or the headlight, I can set X-Plane to only draw the cars when the mouse is in the to half of the screen and only draw the headlights when the mouse is on the right side of the screen. This kind of technique lets us see the instantanious performance change for a feature, giving us a differential under the exact same conditions (same number of cars, same number of cars on screen, same distance away…). This is the ultimate confirmation that a feature costs or doesn’t cost us.

Tuesday, May 23, 2006

Fun with global constructors

(Note: for the purpose of this discussion, "global" objects means:

int a;
static int b;
class foo {
static int c;
};
int foo:c;
void func()
{
static int q;
}

For our discussion, a, b and c are "globals" but "q" is not. While all of these will have static storage allocated for them, a b and c will be initialized during program startup; q will be iniitalized the first time func() runs - possibly never! I will have to rant on how the word static has 3 syntactical meanings and at least that many language meanings some other time.)

The rules for the construction of C++ global objects go something like this:

  • Plain old data (read: int = 0) get initialized before dynamic data (int = some_func(), map). Basically things that can be inited just by splatting their memory are initialized before any code is run.
  • Within a translation unit, dynamic initialization goes in order of the file.
  • Between translation units, dynamic initialization can happen in any order. (Essentially the compiler has no idea which file comes "first".)

That’s enough to make us miserable right there: because the order of static initialization is variable between files, it means that if we have an "API" in a translation unit that requires global data to function, we can’t use it before main() is called because our static initialization code might be running before the API’s. We have no way to control this.

But this is C++ - three rules can’t be everything when it comes to static initialization, right?

  • Dynamic initialization does not have to happen before "main". But it does have to happen before non-initialization code in that translation unit gets called. So going back to our "API" - if we have a global, it will be initialized before the API is used, but it may or may not be initialized before main.
  • Dynamic initialization can be replaced with static initialization (splatting memory) if:
    1. That initialization doesn’t have any side effects on other initialization and
    2. The compiler can figure out what values thet dynamic initialization would have produced under some cirumstances.
It is worth noting that in this case the compiler can initialize our object to the results of the dynamic initialization or some static value that would be legal too since this is happening before we are required to have an initialized object. Most compilers I have played with tend to fill such objects with zero, but it doesn’t look to me like thet spec requires this.

Okay now we’ve got something confusing enough to really do some damange. Not only will C++ call our globals’ constructors in a basically random order between files, but: it may call them in a random order within files by deciding that what we thought was dynamic was really static (poof - that constructor goes to the front of the line), and this may or may not be happening before main is called.

(For what it’s worth, at least CodeWarrior always initializes everything before main - it’s easier for them to make a big linked list of globals and run through it, translation unit by translatoin unit. And frankly since our global will be built before the translation unit is called, initialization after main is the least of our problems in practice.)

It’s pretty easy to get yourself in trouble with these limitations:

//header
class foo {
public:
foo();
~foo();
static void debug_all_foo();
private:
static set all;
};
// CPP implementation
set foo:all; // this is global
foo::foo()
{
all.insert(this);
}
foo::~foo()
{
all.erase(this);
}
void foo::debug_all_foo()
{
for (set::iterator i = all.begin(); i != all.end(); ++i)
(*i)->do_something();
}
// Usage - in a separate CPP file
static foo my_obj; // also global

The idea here is very simple: foo objs maintain a global set of their own ptrs - so we can do something to all foo() if we need. Set was chosen here because on many STL implementations it can’t be zero initialized without the entire world exploding, which is not true of vector. (Vector will however leak memory you use it while zero-initialized, but I digress.)

The problem is this: what gets constructed first? my_obj or foo.all? The answer is: we cannot know. If my_obj is inited first, foo_all contains, well, I don’t know what, but certainly not the necessary dynamically constructed parts to make a set. Thus my_obj will cause a crash before main or do something else not like what we want. If my_obj is initalized second, we go home happy. Only your C++ compiler knows for sure.

(In the case of vector, the fail case is: the global vector gets zeroed if your compiler is into that thing, then the client code puts an object into the vector, since a zero vector is legitimate in a lot of STL implementations, then the real constructor zeros it out again, leaking memory and "losing" your object mysteriously.)

I just went through this fire drill with X-Plane when making a stats-counter class; the stat object tends to be static to a clien’ts code so that it is "just there and ready" and some internal book-keeping keeps a global map of them around so we can zero all counters by catagory. Since static initialization was important, my solution was: use an intrinsically linked list to chain the objects together for tracking. Because the head of the list sis just a dumb pointer initialized to zero, it’s guaranteed to be correct before any code runs. Each constructor simply updates the head pointer and we end up with a linked list with on coflicts.

Generally I can recommend a few techniques to avoid such constructor chaos, but no one technique will fit all:

  • If you have a translation unit that forms an "API", don’t use static objects to initialize your internal state if you depend on an external API. If you can’t avoid this (because for example you have global STL variables and some kind of real initialization) consider breaking the initialization up and doing the initialization later.
  • Dynamically allocate global API-related stuff using operator-new, either in an explicit initialization fuction (called after main) or upon first use of the API.
  • If you can avoid using globals in an API implementation (and instead requiire some kind of "handle") you can push this problem off to client code.
  • Use explicit initialization of sub-systems. It’s simple, debuggable, and you never get into static-constructor trouble.

One comment on that last point: if you build up a table of static constructors to build object factories, you’re going to have to explicitly initialize that table anyway. That’ll have to be another blog entry too.

Thursday, May 18, 2006

Installing Panther over Tiger

I’m blogging this because I’ll never remember it otherwise. Here’s what I had to do to put Panther (OS X 10.3) back onto a disk that had been upgraded to Tiger (OS X 10.4). I did this to set up a coding environment to regress OS-specific problems; I don’t recommend this otherwise.

This was done on a system with two internal HDs, so I could be booted into 10.4 on the main HD while screwing around with the second HD. First I wiped out Tiger as best as I could using sudo rm -r in the terminal. I deleted everything that looked Unixy or OS-ish, including the root level dirs /System /var /tmp /etc /private /bin /sbin /usr and anything else that tempted me. Warning: this is a good way to instantly totally destroy an OS installation.

The trickiest part turned out to be that for some reason my old 10.3 install disk that shipped with the G5 doesn’t appear “blessed” under 10.4. Blessing is basically a note on the disk as to where the boot information lives for Macs. The best way to determine what’s going on is with the aptly named “bless” command utility; the man page explains what it does. bless –info will show you if a volume is not bootable; if it’s not then booting with the “c” key (or any other of the 10 ways to boot from CD-ROM will fail).

So now we come to OpenFirmware. Open Firmware is, well, I don’t know exactly what it is, but for our purposes it’s a command shell before anything is booted where we can monkey around. Before booting into Open Firmware, one thing to check: use pdisk (L command) to list the partitions of all drives - we’ll need to know the partition number of our CD-ROM’s main partition. Strangely it appears that all Mac partitions are really two partitions - a small header and then a real partition. So the CD-ROM partition number is “2″, which we’ll need later but not be able to find from OpenFirmware.

To boot into OpenFirmware, hold down the command, option, ‘o’ and ‘f’ keys all at once on boot. You should see some kind of command prompt. The money command is:

boot cd,2:\System\Library\CoreServices\BootX

This basically means boot from partition 2 from the device aliased to “cd” using BootX (that’s a unix file path but with weird slashes). BootX basically always lives on that path on modern OS X installations. BTW if that file doesn’t exist on your CD-ROM, it may not be bootable.

Other useful commands:

devalias - lists all the aliases to devices. Finding devices in the tree is harder if there aren’t aliases, but my G5 seems to have a bunch of nice ones.
dev [device] - change tree devices, similar to ls.
pwd - print current devices
ls - list devices within the current devices.
dir - list files. Syntax is something like dir cd:2,\ to list the root dir of the second partition on the device aliased to CD.

Friday, May 12, 2006

It Had To Be That Way

I just found an extremely rare bug in X-Plane caused by an uninitialized variable in a constructor; this code functioned in such a way that a compiler cannot do the code anlaysis to find the bug, and unfortunately it happens rarely enough that it’s probably in the shipping sim.

To paraphrase the philosophy of C++:

  • A C++ programmer can do anything, no matter how stupid. After all, 0.001% of the time it might be necessary.
  • The fastest performance path must be accessable via C++.
  • Reducing development time by catching dumb mistakes isn’t even remotely a language goal.

Given this, it’s understandable what happened; the language has to allow me to leave junk in my data because it’s faster not to initialize it and sometimes I want to be lazy for speed. Unfortunately it means that catching errors is up to me, and I am human and fallable, especially when I’ve been drinking beer all night.

It got me thinking about whether there could be a language that provides the performance options of C++ but without the “dangerous environment” of C++. Java and C# are managed; I am definitely among the snotty bitflingers who think that for my app garbage collection and managed memory mean unacceptable performance loss. This probably isn’t true 99% of the time, but in the case of X-Plane, we’ve got a number of specialized allocators (hrm — future blog?) that give us better memory performance than we could get by just newing and deleting objects. (This is indeed the 0.001% that C++ cators to.)

As a straw-man, I’m imagining a language where you have to declare your intention to sin. Basically the rules of the language are restricted until you apply some kind of attribute, similar to static. So most classes would work the slow way, e.g. automatic initialization, perhaps managemed memory, who knows, but then when you tag a class as low-level, you assume responsibility for all aspects of the environment.

My guess is that we’d have to apply such a tag to a very small number of classes in X-Plane, and thus we’d get better compiler support for most of our code.

Sunday, May 07, 2006

std::string + __FILE__ = malloc

So you start off like this:

#define CHECK_ERR(x) __CHECK_ERR(x,__FILE__,__LINE__)
void __CHECK_ERR(const char * msg, const char * file, int line)
{
if (g_error != 0)
printf("An error happened: %s (%s: %d.)\n", msg, file, line)
}

That’s a little goofy but we actually use something like this in X-Plane as a rapid debug-only way to spot OpenGL errors. The __FILE__ and __LINE__ macros give us pin-point messages about where the error was caught even on machines where we can’t easily attach a debugger.

Then later on you decide to embrace the STL string and do this:

void __CHECK_ERR(const string& msg, const string& file, int line);

Ouch. The danger (well, one of the dangers) of C++ is that it will change from a very low level to a very high level of abstraction, changing the underlying implementation from something fast to something slow, without ever telling you.

The problem is that despite using const string& for speed, your inputs (__FILE__ and __LINE__) are string literals. This code will thus create a new string object based on the const char * constructor every time this function is called, and clean up the object when done. In other words, __CHECK_ERR, which was a single if statement now allocates and deallocates memory. Ouch! Pepper error checks liberally around the tight loops that emit OpenGL calls and you’ll really see performance suffer.

This is a fundamental problem in picking between char * and std::string. If your API is declared as char *s but your internal implementation and client code is STL strings, the conversion means a slow allocation of memory where you could have had reference counting. But if your client code and implementation uses char *s and the interface uses STL strings, you allocate a string that isn’t needed.

Our solution with X-Plane is to know our client; virtually all routines use STL strings, except the cases where we know we’re going to fed by a string literal, like a __FILE__ macro. In those few cases we revert to const char *s and convert to STL strings at the latest possible moment to avoid the memory allocate.

Saturday, May 06, 2006

Cleanliness is next to…well, something

At a past company I used to debate the merits of various software engineering techniques with my coworkers. (When someone touched a header that was precompiled we had plenty of time to do this - our product could take hours to rebuild under Visual Studio, which I think spoke against at least certain practices, but that’s another post.) We were very focused on shipping product and helping the company’s business, so the question was: does this practice really make money or does it just make engineers happy.

One thing we’d debate was whether writing clean looking code was worth anything…certainly the compiler doesn’t care if your code looks like this:

void light_mgr::prep_lighting_state(light_type in_type, float in_coords[3])
{
if (settings_mgr::use_slow_lights())
setup_textured_lights (in_type );
else
setup_untextured_lights(in_type );

set_light_ref (in_coords);
}

or this

void light_mgr::PrepLightingState(light_type t,
//async_mgr* /*fMgr*/,
// JJ - removed 4/10/02 float coo[3])
{
#if USING_NEW_LIGHTS
if (settings_mgr::use_slow_lights() && USING_NEW_LIGHTS)
setup_textured_lights(t);
#else
setup_textured_lights(2);
#endif
else
/ * Setup_untextured_lights(10);*/
Setup_untextured_lights(t);

SetLightRef(NULL); // COORDS);
}

At the time I was at least partly convinced that us software engineers tried to make things cleaner than was needed for the bottom line of the company because it’s more pleasant to work on the top code than the bottom, which gives you that warm moist feeling of wading through a swamp. I still do think as I did then that programmers are able to work in less-than-optimal code environments.

I’ve been working with Laminar Research for a while now. One thing Austin insists on is scrubbing the code base regularly. He would never leave code like the lower example in the sim. And I think there is a real benefit: ergonomics.

As programmers we sit in front of the computer screen and concentrate on code all day. I can’t speak for other programmers, but for me at least I am often bounded by concentration - that is, how long can I keep focused on many small diverse details so that the code I write is correct the first time? The answer is…a pretty long time but not forever.

To this end I think there is a benefit from keeping the code clean: I’d rather be spending my mental energy on the work at hand than on the mental translation from the lower to upper code snippet every time I look at it. I find that working on X-Plane is less tiring than other code bases I’ve looked at, and I think that that’s partly because we sweep the cruft out regularly.

Wednesday, May 03, 2006

safety_vector

Austin is undertaking a major refactoring of X-Plane right now. First a little C++: we have a lot of code that's hard-coded to work only on the user's aircraft, which is item 0 of an array. Austin wanted to easily detect every case where we were hard-coding 0 as an index and edit the code to be generic for any aircraft in our array of aircraft. This is a typical case of a refactoring where due to the pure quantity of code and lack of a single overarching abstraction we run the risk of missing cases and leaving in bugs. In the case of our flight model, doing half of our work on one plane and half on another would lead to a catagory of bugs that's almost impossible to detect from using the product (e.g. systems mysteriously failing only under certain intermittent conditoins.) I'm not normally a big fan of C++ gymnastics but here's what we came up with:

  1. We invented a named enumeration for our index into our aircraft array. Because we have finite aircraft, making an enum for each aircraft was trivial.
  2. We overloaded operators ++ and — so we could use the enumeration in for-loops. (C++ doesn't naturally provide math operators for named enumerations.)
  3. We moved our aircraft from a real C array to a vector subclass called "safety_vector". Safety vector is a vector with the regular [] operator declared private and a new [] operator declared that takes a named enumeration rather than an integer (okay, std::size_t) as its parameter.

What does this add up to? Code where this is okay:

for (PLANE_INDEX p = plane0; p < plane8; ++p) analyse_engines(p);

But this provides a compile error:

analyse_engines[0];

Perfect! One other comment on this operation: this kind of wide-scale refactoring of the X-Plane code happens all the time. And having watched this happen for a few versions (and worked on other products with conventional life-cycles) I would have to say that continual refactoring is not a source of bugs. (What is a source of bugs I think is code interdependency, which makes regression hard by causing bugs in areas that we wouldn't expect to be affected by feature work.) The benefit of the continual refactoring is that the code doesn't have any cruft. When we go to put new features in we have a clean workspace, which I think speeds implementation.