Optimized Expressions

Developing hard-core mathematics code, such as physical simulations (constraint networks, cloth, rigid body, fem), is already enough of a challenge without the added burden of having to write the code in an obscure way to avoid typical code inefficiencies. Ideally, you want to devote your brain cycles on the high-level algorithms instead of the details of how you encode them into a .cpp file. Fortunately, efficient code doesn't have to be ugly. This document describes an approach called optimized expressions that can make it easier to work with mathematical code in C++ without sacrificing performance. More specifically, the burden of optimization is done on the lower level routines you care less about. Meanwhile, the important code that you are working on can remain clean.

Magnitude Example

Consider the following line of code that conditionally branches if the length of a vector is less than 3 units long:

    if( magnitude(v) < 3.0f) ...

Many experienced programmers will immediately recognize that the expression is taking a square root when it doesn't have to. Instead they would write it as:

    if( dot(v,v) < 9.0f) ...

Such programmers would typically type it this way the first time - even if the code was not in a critical section. i.e. premature and possibly unnecessary optimization. Essentially the programmer's fingers are part of the compilation process. Through years of experience and pattern recognition, this line of code is likely to be instantly recognized as a magnitude check by another game programmer with similar experience. However, if another developer is not wearing the same coke-bottle eyeglasses of automatic disassembly, he/she will have to spend a bit more time reverse engineering the code to figure out what it is actually doing.

Clearly the initial implementation that uses magnitude() is "higher level" and most directly corresponds to the way anybody, including the hardcore types, would actually think about that line of code. Ideally it would be nice to keep that line of code exactly the way it is, yet have the program run as efficiently as had we written it using the dot product and the square of the comparison operand. Well, even without a super smart compiler, it is possible. Consider the following implementation of magnitude:


	class magnitude {
	 public:
	   float m2;
	   magnitude(const float3 &v):m2(dot(v,v)){}
	   operator float(){return sqrtf(m2);}
	   int operator < (const float &rhs) {return (m2 < rhs*rhs);} 
	};

Even though magnitude is a class instead of a function, its usage remains the same. There's no changes to how you write the code using magnitude(). The difference is how the code executes. Now if we ever need to compare the magnitude of a vector with a float, the magnitude class will instead square the right hand side of the equation. Should we need the actual value, the cast operator will complete the computation of the vector's length.

I'm not suggesting that people have to apply this approach to programming for such a trivial function like magnitude(). This example is just to illustrate the general idea of optimizing expressions without having to change them. Admittedly, the implementation of our magnitude function has been modified away from a typical implementation. However, once we know this implementation works, we don't worry about it after that. The average game programmer will spend a typical day working on code that uses magnitude instead of the code that implements magnitude. In general, optimized expressions pursues a form of code optimization that requires rearrangement and obfuscation that is done on the core well-understood basic math library routines. Meanwhile, the important high-level algorithms and game code remain as clean as possible.

Large Vectors and Matrices

Many applications make use of N dimensional vectors and matrices. There are many more applications that could use generic N dimensional vectors but were not initially written that way. For example, a physical system may be represented as an array of its parts where element in the array is a struct containing that part's properties including position, orientation, velocity, angular momentum, forces, and so on. Initially this might seem more intuitive during development. In order to utilize certain simulation or integration techniques, it is better to think of the entire system as a whole and represent these properties as large vectors. For example, the positions of all the parts is concatenated into a single large state vector, the velocities into another big vector and so on. Auxiliary and temporary vectors such as change in velocity (dv) are implemented using a single big vector as well. Then integration can be viewed as operations on these larger vectors. As a result, the algorithms will often fit into standard off-the-shelf linear methods. After this refactoring, the code for the physical system will be a structure of arrays instead of having an array of structs. It's not just source code, many papers on physical modeling present their research using this type of representation.

I'd suggest an array of float3s encapsulated in a class called float3N or something. Then you can index into it nicely without multiplying by 3. Then you can write operators and functions to work on these big vectors. Note that this is just a suggestion; coding styles are an individual choice.

Optimization will be important since any physical simulation implementation is always judged by performance. A simple implementation of a big vector and sparse matrix library probably wont be efficient if you are using C++ overloaded operators (+-*/). There is likely to be lots of big vector objects constructed, memory allocated, and excessive copying. For example, look at the line of code:

 V = A*s + B*t;  // V,A,B are big vectors,  s,t are scalars 

Using typical operator overloading, the scalar multiply operations will each create temporary big vectors and likely allocate memory to do so. Then these two new vectors are added together into another newly allocated vector which then gets copied into the final destination 'V'.

All of this work isn't necessary and should be avoided. One option is to write inline for loops for any sort of express that will iterate over the elements of the arrays. This turns every such line of code into many lines of code and makes it harder to read. While computers are good at boring repetitive tasks, programmers are not. So this approach is likely to introduce bugs along the way since every single mathematical expression has to be converted by hand. The internals of the big vector classes are exposed at every usage which defeats the idea of encapsulation and abstraction.

Another option is to go back to C-style programming. Referring to our vector addition example we would provide a single function to do the job:

 VecSumOfMuls(V,A,s,B,t);

At least now we wont have to have a for loop for each statement that happens to do something with variables of our big vector type. The same function can be used for any similar line of code. Unfortunately, the meaning isn't obvious without looking up the function in a header file. Furthermore, different math library programmers will likely pick alternative argument orderings and different function names since there is no obvious single correct standard. The clearest way of writing this line of code in an implementation would simply be the universal standard:

 V = A*s + B*t; 

That's what we'd really like to see in our code. Fortunately, efficient execution of such code can be achieved using an expression optimization system. Recall that how this expression gets processed depends on the implementation of the operators. Like our magnitude example, the idea is that operators do not cause any processing to occur. Instead they simply return expression objects that can be combined with other expression objects. Only the final assignment triggers execution of the entire expression which is implemented in a single function. So you will need one such class for each expression form that you are going to use in your code (and the subexpressions leading up to it). Even the longer mathematical lines of code only consist of a handful of terms. So in practice, you are probably wont need too many expression classes. See the comments and code in the vec3n.h header file for more details and the implementation. By using this library, the standard conjugate gradient algorithm can now be cleanly implemented: conjgrad.cpp.

What About Template MetaProgramming

One approach is to write a template metaprogramming implementation. Its like the expression optimization approach described in the previous section, however, template compiler tricks are used to automatically generate the required classes instead of manually programming these by hand.

Feel free to see the file vec3ntm.h header file for an example of how you might implement a large vector/matrix library using template metaprogramming. You could probably find better samples out on the web.

While it's certainly sexy, metaprogramming isn't always that great for practical use. It requires that your compiler is able to properly recurse and produce the minimal code. You only get optimization in 'release' mode. Furthermore, you have to inspect the resulting assembly code to verify that it did produce the optimized result you were looking for. There is less control over the result since you didn't implement it directly. Stepping through the code in debug mode can be very confusing. Only by implementing template metaprogramming code yourself would you have much hope of understanding it.

Further Optimization

Optimization shouldn't stop with simple code rearrangement. In addition to optimized expressions you may wish to also explore 128 bit simd acceleration. Unlike MetaProgramming, it would be fairly easy to do this in conjunction with the more straightforward optimized expression system. For big vector libraries, there are two approaches you can take. One option is to keep the vectors packed and think of 128bit simd register is that it just happens to do 4 at a time. x0y0z0x1, then y1z1x2y2, and so on. This works for some operations (but not all) where it is possible to think of them as happening on the 'big' vectors. Alternatively, one can pad the float3s so they are actually float4s. This would guarantee the memory alignment when doing the sparse block matrix multiply. The downside is the increased memory cost - which implies a performance cost. I don't know which is the better strategy here. You'd really have to test both yourself.

Concluding Remarks

There is a reason why developers use high level languages - it makes them more productive. The value of the source code does not end once the initial binary executables have been compiled. Source code can be modified or reused in other applications. However, source code typically gets chopped up and rearranged due to optimizations, feature hacks, and poorly implemented bug fixes. After this happens, the now unreadable code has less potential for future reuse. So to maximize the value of the ascii text in a .cpp file, its best to keep it true to the algorithm that it is implementing. After all, that's why it's called the "source" - something that is effectively lost when a "human compilation step" is added to the development process.


S Melax 2006