Monthly Archives: April 2010

Floating-point makes my brain melt

Sometimes it’s better to do the obvious than try to be “correct”.

How would you check if a floating-point variable x was zero? Common sense says x == 0.0 should work, right? But the compiler gets cranky about floating-point compares, yet zero is certainly a valid sentinel value even for floating-point. So I found the fpclassify() function. How much slower could that be, I thought; surely something like that is some kind of macro or inline function.

I made an assumption. Oops.

Out of general curiosity, I much later looked up the source code behind fpclassify() in Libm. Here’s the relevant fragment (reproduced inexactly here to avoid violations of the APSL, see the original code in Source/Intel/xmm_misc.c in the Libm-315 project at http://opensource.apple.com if curious):

if (__builtin_expect(fabs(d) == 0.0, 0))
    return FP_ZERO;

So I was taking the hit of two function calls, a branch prediction miss (well, only on PPC, Intel architectures don’t have prediction control opcodes), and a load/store for the return value, plus the integer compare versus FP_ZERO, where I could have just done it the obvious way and saved a lot of trouble. Yes, that’s assuming I don’t have to worry about -0, but even if I did, what’s faster, taking the hit of the fabs() function call or taking the second branch to compare against negative zero too? For reference, fabs() on a double, implemented in copysign.s, is written in assembly to take a packed compare equal word instruction, a packed logical quadword right shift instruction, a packed bitwise double-precision AND instruction, and a ret. Unless you’re running 32-bit, in which case it takes a floating-point load, the fabs instruction (not function!), and the ret. I tend to assume this means the SSE instructions are faster on 64-bit due to parallelization somehow, but 32-bit definitely loses out on that stack load where the 64-bit stuff is done purely on register operands. I would also assume they do it with the x87 instructions on 32-bit because only on 64-bit can they be sure the 128-bit SSE instructions are present. Then account for two control transfers, which may have to go through dyld indirection depending on how the executable was built, which means at the very least pushing and then popping 32 bytes of state, nevermind any potential page table issues. It’s a damn silly hit to take if I never have to worry about negative zero! I can safely guess, without even running a benchmark, that x == 0.0 is a whole heck of a lot faster than fpclassify(x) == FP_ZERO.

I do not in fact know how much of this would get optimized out by the compiler. GCC and LLVM are both pretty good with that kinda thing. But there’s no __builtin_fpclassify() in GCC 4.2! It doesn’t exist until at least 4.3, possibly 4.4. I can’t find it in Apple’s official version of Clang either! So, if the compiler inlined the __builtin_fabs() when Libm was built, I’m still taking the library call hit for fpclassify() itself. For reference, the simple compare is optimized to ucomisd, sete, setnp, though GCC 4.2 and LLVM/Clang use different register allocations to do it.

Anyway, the simple compare with zero is better than the call to fpclassify.

Missions of the Reliant: Something to see at last!

There is now a user-visible change to Missions from all the background work I’ve been doing!

  1. The warp drive physics have been heavily corrected to match the original game rather than ignoring decreases in warp speed and turning the wrong way around.

It’s something, at least. You don’t want to know how much math went into that.

Speaking of which, sqrt(x*x+y*y) is faster than the polar/Cartesian conversion partly because an efficient function, hypot(), exists to perform exactly that calculation quickly already, and the compiler was converting the sqrt() call to __builtin_hypot() for me. The compiler’s smarter than me.

Stay tuned for news that I’ve finally managed to implement the fighter AI. It is coming, I promise!

Missions of the Reliant: I’m haunted by coordinate systems!

As if all the mucking about with coordinates before wasn’t bad enough, next I had to deal with unit vectors, polar/Cartesian coordinate conversion, sign adjustment vs. trigonometric functions… you get the idea.

In this case, my problem wasn’t caused by needing to update the algorithms Mike used at all, but rather by my need to replace the old MacToolbox FixRatio() and AngleFromSlope() functions with modern trigonometrics. Now, I’d already done all this, or else the impulse and warp drives would never have worked for this long, but in poking about in the implementation for mobile enemies, I realized I’d have to generalize the code, or else end up repeating it in about a dozen places, a well-known recipe for disaster.

In literal code, warp speed goes like this:

double diffx = warpCoordX - playerCoordX, diffy = warpCoordY - playerCoordY,
       theta_raw = atan2(-diffy, diffx), theta = round(fma(theta_raw, RAD_TO_DEG_FACTOR, 360 * signbit(theta_raw))),
       // make use of theta in degrees here to calculate a turn factor
       maxSpeed = warpSpeed,
       newDeltaX = cos(theta * DEG_TO_RAD_FACTOR), newDeltaY = -sin(theta * DEG_TO_RAD_FACTOR),
       finalDeltaX = playerDeltaX + newDeltaX, finalDeltaY = playerDeltaY + newDeltaY;
if (fabs(addedDeltaX) >= fabs(maxSpeed) * newDeltaX) finalDeltaX = maxSpeed + newDeltaX;
if (fabs(addedDeltaY) >= fabs(maxSpeed) * newDeltaY) finalDeltaY = maxSpeed + newDeltaY;

Conceptually, this reads:

  1. Calculate the difference between the player’s current position and the destination in Cartesian coordinates.
  2. Take the arctangent of the Cartesian coordinates, adjusted for inverted Y, converted to degrees and adjusted to the range [0,360] (atan2() returns [-Ï€,Ï€]).
  3. Convert the polar coordinates (using the implied mangitude of 1) back to Cartesian coordinates.
  4. Calculate the movement delta with a speed limit.

Why, one wonders, do I do a Cartesian->polar conversion, only to immediately convert back to Cartesian again? Answers: 1) I need the angle to calculate which way to turn the ship towards its destination. 2) The distance between current position and destination is a vector of (usually) rather high magnitude; I need to normalize that vector to get a delta. And the formula for normalizing a Cartesian vector is x/sqrt(x*x+y*y), y/sqrt(x*x+y*y). Two multiplies, an add, a square root, and two divides, all floating point. Without benchmarking I still intuitively think that’s slower (and I’m SURE it’s conceptually more confusing) than cos(atan2(-y, x)), -sin(atan2(-y, x)), two negations, an arctangent, a sine, and a cosine. Maybe I’m crazy.

Of course, typing all this out made me realize that I can, in fact, eliminate the degree/radian conversion entirely, as well as the range adjustment, by changing the conditional in the turn calculation. Once again I fell for the trap of not thinking my way through the code I was porting. At least you weren’t as bad at geometry as me, Mike :-).

Then I had to go and get really curious and benchmark it:

#include #include #include int main(int argc, char **argv) { struct timeval cS, cE, pS, pE; // Volatile prevents compiler from reading the loops as invariant and only running them once. volatile double x1 = 1.0, x2 = 2.5, y1 = 0.4, y2 = 3.2, dx = 0.0, dy = 0.0, inter = 0.0; gettimeofday(&cS, NULL); for (int i = 0; i < 100000000; ++i) { dx = x2 - x1; dy = y2 - y1; inter = sqrt(dx*dx+dy*dy); dx /= inter; dy /= inter; } gettimeofday(&cE, NULL); gettimeofday(&pS, NULL); for (int i = 0; i < 100000000; ++i) { inter = atan2(y2 - y1, x2 - x1); dx = cos(inter); dy = sin(inter); } gettimeofday(&pE, NULL); struct timeval cD, pD; timersub(&cE, &cS, &cD); timersub(&pE, &pS, &pD); printf("Cartesian diff = %lu.%06u\n", cD.tv_sec, cD.tv_usec); printf(" Polar diff = %lu.%06u\n", pD.tv_sec, pD.tv_usec); return 0; } [/c]

Foot in mouth. cos|sin(atan2()) is consistently 3x slower than x|y/sqrt(x^2+y^2) at all optimization levels. Somehow I just can’t see this as being an artifact of the brutal abuse of volatile for the benchmark.

Mike got around the whole issue, in the end. Knowing that he only ever had to calculate cos()/sin() for the angles in the period [0,35]*10, he just precalculated them in a lookup table. And cutting the cosine and sine calls out of the benchmark reduces the difference between the methods to about 1.6x, making his way a win over a four-way compare/branch for the turn calculation.

Live and learn.

Oh, and changes to Missions: Again, nothing you can see in play yet. But at least now I have the right building blocks to make the enemies from.

Missions of the Reliant: Math is fun, or why I wish I hadn’t flunked geometry

At last, an update!

  1. Absolutely nothing visible to the user has changed whatsoever.
  2. The internal structure of the code has been significantly reorganized.

As with the lament of all programmers faced with the demands of the technologically disinclined, I’ve accomplished a great deal, but since it can’t be seen, it might as well be nothing at all. Wasted time, the hypothetical slave driver- I mean, boss- would say. But it isn’t, I swear to all two of you who read this blog!

And now, another math rant.

Once again, as with so many things, the way Mike did things in the original code was correct and logical for the time he did it, but doesn’t fit into the object-oriented model I’m cramming his code into, despite its pitiful cries for mercy from such rigid structure. There are days I wish we were living in times when code could be so freeform as his was and still be comprehensible, but you can’t do that in Cocoa. Oh sure, I could port all the Pascal functions 1-to-1, but the Toolbox calls would be sticky at best. Anyway, in this particular case, I was trying to wrestle with the radar range calculation.

The original code reads something vaguely like: screenPos = Planetabs - (Playerabs - Playerscreen)  inRadarRange = n <= screenPos / 16 <= m. Translating, this means that whether or not a given entity (a planet in this case) is within radar range of the planet is dependant upon the Player’s position in screen coordinates, as well as in the game’s absolute coordinate system.

In the old days, this design made a certain amount of sense. He already had the screen coordinates immediately handy, so why take the hit of indirecting through A5 to touch a global for the absolute position? However, my design makes the screen coordinates a bit dodgy to use. So I had to recalibrate n and m to represent distances in game coordinates.

Algebra to the rescue. The code above, reduced and replacing the inequalities, becomes the algebraic equation (x - (y - z)) / 16 = a, where a is the radar range coordinate. The only screen coordinate term in this equation is z, so solve to eliminate z:

(x - (y - z)) / 16 = a
x - (y - z) = 16a       - multiply both sides by 16
x - y + z = 16a         - distribute the subtraction over the parenthetical expression
x - y = 16a - z         - subtract z from both sides

But, because both a (the radar range) and z (the player’s position on screen) are actually constants, all I had to do was take Mike’s original numbers (let’s use 64 for a and 268 for z) and calculate 16*64 - 268 = 756. Then, retranslating, the equation becomes the inequality inRadarRange = (myPosition - playerPosition) <= 756;. Repeat for the lower and upper bounds of x and y coordinates, and boom, no screen coordinates at all and I can calculate whether or not an object's in radar range based on nothing but its offset from the player.

To be clear, what I did up there was to eliminate a term from the inequalities so that they could be evaluated based on the position of the given entity in game space, rather than on the position of the entity's sprite on the screen.

I can't believe it took me a week to doodle out that bit of math.

Subtle pitfalls of doing things in -dealloc

Let me describe the setup for this issue.

Take a gobal (set property on a singleton object) list of objects which holds only weak references (in this case, if it held strong references, the objects would never lose their last retain and could never be deallocated, yes I know GC avoids it):

@implementation Singleton
- (void)init
{
    if ((self = [super init]))
    {
        CFSetCallBacks cb = { 0, NULL, NULL, CFCopyDescription, CFEqual, CFHash };
        NSMutableSet *container = (NSMutableSet *)CFSetCreateMutable(kCFAllocatorDefault, 0, &cb);
    }
    return self;
}

- (NSMutableSet *)container { return container; }
- (void)setContainer:(NSSet *)value { if (value != container) { [container setSet:value]; } }
// The other KVC-complaint methods for an unordered to-many relationship

// Singleton pattern stuff here
@end

Now consider a class which, as part of its initialization and deallocation, adds itself to and removes itself from this singleton’s container:

@implementation ListedObject
- (id)init
{
    if ((self = [super init]))
    {
        [[Singleton singleton] addContainerObject:self];
    }
    return this;
}

- (void)dealloc
{
    [[Singleton singleton] removeContainerObject:self];
    [super dealloc];
}
@end

This pattern is useful when you need to track every instance of a given class – in this case, my actual code maintains the set property as a property on the class object itself rather than a separate singleton, but the result is the same and I thought I’d use something more familiar for this example.

The first time you release an instance of ListedObject to the point of deallocation, you’ll crash during your next event loop.

Next event loop? Experienced Cocoa readers will immediately guess “autorelease pool”. And they’d be right. To debug this, I added backtrace_symbols_fd() calls to -retain and -release of the ListedObject class. This may seem strange versus using a GUI tool like Instruments’ “Object Allocation” template, but I’m old-fashioned and this was simple. The object was indeed being overreleased, with the extra release coming from the main event loop’s autorelease pool.

The cause is rather intricate. At the time of the deallocation, I had a registered key-value observation on the container property. So, when the object removed itself from the list, KVO called [NSSet setWithObject:removedObject] in order to pass it as part of the change dictionary to the observer callback. This naturally went on the autorelease pool. But oops, we were already in -dealloc of the removed object, so the retain the autoreleased set tried to add was a no-op. Finally, next time through the event loop, that set was freed by the autorelease pool, and tried to call -release on the removed object, but that object had already been fully deallocated. Crash!

Now, a purist will say “stop trying to do things in -dealloc!” Others would point out that GC would bypass this entire problem. Either way, I wanted a simple solution. The problem was an autorelease pool, so just add another one!

- (void)dealloc
{
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
    [[Singleton singleton] removeContainerObject:self];
    [pool release];
    [super dealloc];
}

KVO’s set is no longer deferred-release, and all is well.

As it happens, this uncovered an underlying issue in my code, a design flaw which essentially makes the entire debacle unnecessary because the list management needed to be in other methods rather than -init and -dealloc, but I thought it was an interesting note nonetheless.

Missions of the Reliant: Cleaning up the wreckage of the train crash

I’m back, and I didn’t give up on Missions! I’m sure there must be exactly one person out there who cares :-).

But seriously. I don’t have any new features to show at the moment, unfortunately. When I went to implement the laser cannon for the player, I realized I’d never be able to test it without something to fire at. I also realized the cannon itself would be useless without the target scanner since it has to lock onto a target. The scanner is also useless without something to scan. So, it was time to implement the base code for mobile enemies. Probably should’ve done that long ago, and here’s why…

As we all know, I’m using Objective-C to write this code. That means, among other things, that my code is object-oriented in nature. Up until this point, things like planets, starbases, and the player had all been entirely separate implementations. This is what Mike did with the original code. As always, what he did then was only sensible for the time and environment, but I can avoid a hell of a lot of code duplication by giving everything that exists in space a common superclass: a “Presence”. (Presences are themselves subclasses of the even more general “Responder”, which is used for everything that needs to process game happenings in any way, but that’s only a side note). As one can imagine, since I didn’t have the foresight to design the code this way to begin with, implementing it now required some significant refactoring.

Another issue cropped up halfway through the refactoring: The severe limitations of Apple’s built-in Key-Value Observing, which I use extensively throughout the code to avoid having to call “update this” and “update that” manually for every single affected object whenever something changes. For example, KVO doesn’t let you use blocks for callbacks, and if a superclass and a subclass both register for the same notification, there’s no way to manage the two independantly. Fortunately, Michael Ash noticed these problems some time back, and created a replacement, his MAKVONotificationCenter. Unfortunately, even the updated version published by Jerry Krinock didn’t do everything I needed, at least not in a way that I found usable with blocks added to the equation. Managing observations by tracking the resulting observation objects means having lots of instance variables to hold the observations, and since I’m building for Leopard, I can’t use the new associated objects for the purpose.

“Wait a minute,” you’re saying! “Leopard? Then why are you talking about using blocks?” Answer: I’m using PLBlocks.

So, armed with PLBlocks on one side, and Michael Ash’s typically brilliant code on the other, I dove in and pretty much rewrote the entire MAKVONotificationCenter to do three things it didn’t before:

  1. Block callbacks.
  2. Tagging observations with a simple integer value.
  3. Several alternative ways of specifying groups of observations to remove, based on observer, target, key path, selector, tag, or most combinations thereof.

With that done (and unit tested, and Doxygen-documented), I’m now integrating them into my revised class heirarchy for Missions itself. With any luck, I’ll have at least a screenshot of a fighter flying around before the week is out. Stay tuned, those of you who are crazy enough to stick around for all this :-).

Footnote: I was finally able to find a way to access the original model files for the game’s graphics; with some luck and a bit of help from Mike (I’m clueless when it comes to this stuff), there may be higher-quality graphics to be seen in the screenshots soon.

Missions of the Reliant: Status

I’m sure some have been wondering why there’ve been no Missions updates lately. The answer is simple: I haven’t been working on it. I caught an awful cold last week; I’m only just now managing to get rid of it and I haven’t been able to write a line of code.

My apologies to all who await eagerly; I hope to be back on track soon.