Tag Archives: optimization

Missions of the Reliant: Engine room, flight recorder visual, fifty-one point nine one zero

A QTKit-based video recorder is now integrated into the code. I tried about twenty ways to get it to record audio too, but between CoreAudio’s failings and QTKit’s limitations, nothing both sounded correct and remained correctly synchronized.

  1. Capture the sound output of the game and add it as a sound track to the video. Failure reason: CoreAudio provides insufficient API to do this when using OpenAL.
  2. Pipe the sound output through SoundFlower and add it as a sound track to the video. Because OpenAL is broken on OS X, this necessitated changing the *system* default audio output device to use SoundFlower. Failure reason: Because video was recorded one frame at a time, with the accompanying delays necessary to compress each frame, while the audio was recorded in realtime, synchronization was impossible.
  3. Pipe the output through SoundFlower and manipulate the audio data to solve the synchronization issues. Failure reason: QTKit, unlike the original QuickTime API, provides no API whatsoever for manipulating raw audio data in a movie.
  4. Add the sounds used by the game as tracks to the video. Failure reason: QTKit’s API again proved unequal to the task, even in the Snow Leopard version and using SPI, an approach quickly abandoned.
  5. Record each sound event, construct a sound track from those events, and add that track to the video. Failure reason: QTKit’s API once again.
  6. Forgo QTKit entirely and use FFmpeg to do the media authoring. Failure reason: The documented -itsoffset flag is not implemented by the FFmpeg commandline driver, nor correctly supported by the supporting libraries.
  7. Manually manipulate every input sound file to have the necessary time of silence at the beginning, then pipe through FFmpeg or QTKit. Failure reason: The entire effort was becoming ridiculous, and I felt my time would be better spent working on the actual game and worrying about something like that much later, especially since there was no need for it at all.

In every case, QTKit either had no API to accomplish the task, or its provided APIs didn’t work correctly, as with FFmpeg. I wasn’t able to drop back to the old QuickTime API because it isn’t supported in 64-bit code and I intended this game to be forward-compatible.

There was one interesting side note to all this. In the process of recording video frames, I naturally ran into the issue that OpenGL and QuickTime have flipped coordinate systems relative to each other. Rather than play around with matrices, I wrote a quick in-place pixel flipping routine:

– (void)addFrameFromOpenGLAreaOrig:(NSRect)rect { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSUInteger w = rect.size.width, h = rect.size.height, rowBytes = w * sizeof(uint32_t), i = 0, j = 0, rowQs = rowBytes >> 3; void *bytes = [[NSMutableData dataWithLength:h * rowBytes] mutableBytes]; uint64_t *p = (uint64 *)bytes, *r = NULL, *s = NULL; NSImage *image = [[[NSImage alloc] init] autorelease]; glReadPixels(rect.origin.x, rect.origin.y, w, h, GL_RGBA, GL_UNSIGNED_BYTE, bytes); for (i = 0; i < h >> 1; ++i) for (j = 0, r = p + (i * rowQs), s = p + ((h – i) * rowQs); j < rowQs; ++j, ++r, ++s) *r ^= *s, *s ^= *r, *r ^= *s; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&bytes pixelsWide:w pixelsHigh:h bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:rowBytes bitsPerPixel:32] autorelease]]; [self addFrame:image]; [pool drain]; } [/objc]

No doubt the more skilled among you can see the ridiculous inefficiency of that approach. Through staring at the code a great deal, I was able to reduce it to:

– (void)addFrameFromOpenGLArea:(NSRect)rect { // All of this code assumes at least 16-byte aligned width and height // Start r at top row. Start s at bottom row. // For each row, swap rowBytes bytes (in 8-byte chunks) of r and s, incrementing r and s. // Width = the number of 8-byte chunks in two rows (rb = w * 4, rq = rb / 8, times two rows = ((w*4)/8)*2 = (w/2)*2 = w NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSUInteger w = rect.size.width, h = rect.size.height, i; uint64_t *p = malloc(h * w << 2), *r = p, *s = p + (h * (w >> 1)); NSImage *image = [[[NSImage alloc] init] autorelease]; glReadPixels(rect.origin.x, rect.origin.y, w, h, GL_RGBA, GL_UNSIGNED_BYTE, p); for (; s > r; s -= w) for (i = 0; i < w; i += 2) *r ^= *s, *s ^= *r, *r++ ^= *s++; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&p pixelsWide:w pixelsHigh:h bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:w << 2 bitsPerPixel:32] autorelease]]; [self addFrame:image]; free(p); [pool drain]; } [/objc]

Much better, but still pretty inefficient when the size of every single frame is the same. Why keep redoing all those width/height calculations and buffer allocation and defeat loop unrolling? So I wrote a specialized version for 640×480 frames, with all the numbers precalculated.

– (void)addFrameFromOpenGL640480 { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; if (frameBuffer == NULL) frameBuffer = malloc(1228800); glReadPixels(0, 0, 640, 480, GL_RGBA, GL_UNSIGNED_BYTE, frameBuffer); register uint64_t i, *r = frameBuffer, *s = r + 153280; for (; s > r; s -= 640) for (i = 0; i < 40; ++i) { *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; } NSImage *image = [[[NSImage alloc] init] autorelease]; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&frameBuffer pixelsWide:640 pixelsHigh:480 bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:2560 bitsPerPixel:32] autorelease]]; [self addFrame:image]; [pool drain]; } [/objc]

I took a look at the code the compiler produces at -O2, and I’m fairly sure that its assembly will run parallelized, though not actually vectorized.

Yes, I’m fully aware that glReadPixels() is slower than creating a texture. I was testing my optimization skill on the low-level C stuff, not the entire routine. I only regret I didn’t have the patience to try doing it in raw SSE3 assembly, because I recognize an algorithm like this as being ideally suited to vector operations.

Floating-point makes my brain melt

Sometimes it’s better to do the obvious than try to be “correct”.

How would you check if a floating-point variable x was zero? Common sense says x == 0.0 should work, right? But the compiler gets cranky about floating-point compares, yet zero is certainly a valid sentinel value even for floating-point. So I found the fpclassify() function. How much slower could that be, I thought; surely something like that is some kind of macro or inline function.

I made an assumption. Oops.

Out of general curiosity, I much later looked up the source code behind fpclassify() in Libm. Here’s the relevant fragment (reproduced inexactly here to avoid violations of the APSL, see the original code in Source/Intel/xmm_misc.c in the Libm-315 project at http://opensource.apple.com if curious):

if (__builtin_expect(fabs(d) == 0.0, 0))
    return FP_ZERO;

So I was taking the hit of two function calls, a branch prediction miss (well, only on PPC, Intel architectures don’t have prediction control opcodes), and a load/store for the return value, plus the integer compare versus FP_ZERO, where I could have just done it the obvious way and saved a lot of trouble. Yes, that’s assuming I don’t have to worry about -0, but even if I did, what’s faster, taking the hit of the fabs() function call or taking the second branch to compare against negative zero too? For reference, fabs() on a double, implemented in copysign.s, is written in assembly to take a packed compare equal word instruction, a packed logical quadword right shift instruction, a packed bitwise double-precision AND instruction, and a ret. Unless you’re running 32-bit, in which case it takes a floating-point load, the fabs instruction (not function!), and the ret. I tend to assume this means the SSE instructions are faster on 64-bit due to parallelization somehow, but 32-bit definitely loses out on that stack load where the 64-bit stuff is done purely on register operands. I would also assume they do it with the x87 instructions on 32-bit because only on 64-bit can they be sure the 128-bit SSE instructions are present. Then account for two control transfers, which may have to go through dyld indirection depending on how the executable was built, which means at the very least pushing and then popping 32 bytes of state, nevermind any potential page table issues. It’s a damn silly hit to take if I never have to worry about negative zero! I can safely guess, without even running a benchmark, that x == 0.0 is a whole heck of a lot faster than fpclassify(x) == FP_ZERO.

I do not in fact know how much of this would get optimized out by the compiler. GCC and LLVM are both pretty good with that kinda thing. But there’s no __builtin_fpclassify() in GCC 4.2! It doesn’t exist until at least 4.3, possibly 4.4. I can’t find it in Apple’s official version of Clang either! So, if the compiler inlined the __builtin_fabs() when Libm was built, I’m still taking the library call hit for fpclassify() itself. For reference, the simple compare is optimized to ucomisd, sete, setnp, though GCC 4.2 and LLVM/Clang use different register allocations to do it.

Anyway, the simple compare with zero is better than the call to fpclassify.

Missions of the Reliant: I’m haunted by coordinate systems!

As if all the mucking about with coordinates before wasn’t bad enough, next I had to deal with unit vectors, polar/Cartesian coordinate conversion, sign adjustment vs. trigonometric functions… you get the idea.

In this case, my problem wasn’t caused by needing to update the algorithms Mike used at all, but rather by my need to replace the old MacToolbox FixRatio() and AngleFromSlope() functions with modern trigonometrics. Now, I’d already done all this, or else the impulse and warp drives would never have worked for this long, but in poking about in the implementation for mobile enemies, I realized I’d have to generalize the code, or else end up repeating it in about a dozen places, a well-known recipe for disaster.

In literal code, warp speed goes like this:

double diffx = warpCoordX - playerCoordX, diffy = warpCoordY - playerCoordY,
       theta_raw = atan2(-diffy, diffx), theta = round(fma(theta_raw, RAD_TO_DEG_FACTOR, 360 * signbit(theta_raw))),
       // make use of theta in degrees here to calculate a turn factor
       maxSpeed = warpSpeed,
       newDeltaX = cos(theta * DEG_TO_RAD_FACTOR), newDeltaY = -sin(theta * DEG_TO_RAD_FACTOR),
       finalDeltaX = playerDeltaX + newDeltaX, finalDeltaY = playerDeltaY + newDeltaY;
if (fabs(addedDeltaX) >= fabs(maxSpeed) * newDeltaX) finalDeltaX = maxSpeed + newDeltaX;
if (fabs(addedDeltaY) >= fabs(maxSpeed) * newDeltaY) finalDeltaY = maxSpeed + newDeltaY;

Conceptually, this reads:

  1. Calculate the difference between the player’s current position and the destination in Cartesian coordinates.
  2. Take the arctangent of the Cartesian coordinates, adjusted for inverted Y, converted to degrees and adjusted to the range [0,360] (atan2() returns [-Ï€,Ï€]).
  3. Convert the polar coordinates (using the implied mangitude of 1) back to Cartesian coordinates.
  4. Calculate the movement delta with a speed limit.

Why, one wonders, do I do a Cartesian->polar conversion, only to immediately convert back to Cartesian again? Answers: 1) I need the angle to calculate which way to turn the ship towards its destination. 2) The distance between current position and destination is a vector of (usually) rather high magnitude; I need to normalize that vector to get a delta. And the formula for normalizing a Cartesian vector is x/sqrt(x*x+y*y), y/sqrt(x*x+y*y). Two multiplies, an add, a square root, and two divides, all floating point. Without benchmarking I still intuitively think that’s slower (and I’m SURE it’s conceptually more confusing) than cos(atan2(-y, x)), -sin(atan2(-y, x)), two negations, an arctangent, a sine, and a cosine. Maybe I’m crazy.

Of course, typing all this out made me realize that I can, in fact, eliminate the degree/radian conversion entirely, as well as the range adjustment, by changing the conditional in the turn calculation. Once again I fell for the trap of not thinking my way through the code I was porting. At least you weren’t as bad at geometry as me, Mike :-).

Then I had to go and get really curious and benchmark it:

#include #include #include int main(int argc, char **argv) { struct timeval cS, cE, pS, pE; // Volatile prevents compiler from reading the loops as invariant and only running them once. volatile double x1 = 1.0, x2 = 2.5, y1 = 0.4, y2 = 3.2, dx = 0.0, dy = 0.0, inter = 0.0; gettimeofday(&cS, NULL); for (int i = 0; i < 100000000; ++i) { dx = x2 - x1; dy = y2 - y1; inter = sqrt(dx*dx+dy*dy); dx /= inter; dy /= inter; } gettimeofday(&cE, NULL); gettimeofday(&pS, NULL); for (int i = 0; i < 100000000; ++i) { inter = atan2(y2 - y1, x2 - x1); dx = cos(inter); dy = sin(inter); } gettimeofday(&pE, NULL); struct timeval cD, pD; timersub(&cE, &cS, &cD); timersub(&pE, &pS, &pD); printf("Cartesian diff = %lu.%06u\n", cD.tv_sec, cD.tv_usec); printf(" Polar diff = %lu.%06u\n", pD.tv_sec, pD.tv_usec); return 0; } [/c]

Foot in mouth. cos|sin(atan2()) is consistently 3x slower than x|y/sqrt(x^2+y^2) at all optimization levels. Somehow I just can’t see this as being an artifact of the brutal abuse of volatile for the benchmark.

Mike got around the whole issue, in the end. Knowing that he only ever had to calculate cos()/sin() for the angles in the period [0,35]*10, he just precalculated them in a lookup table. And cutting the cosine and sine calls out of the benchmark reduces the difference between the methods to about 1.6x, making his way a win over a four-way compare/branch for the turn calculation.

Live and learn.

Oh, and changes to Missions: Again, nothing you can see in play yet. But at least now I have the right building blocks to make the enemies from.

Pointless optimization

So I was looking at the macro used to calculate 16-bit parity in pure C without branching:

#define parity(v) ({ \ uint16_t pv = (v); \ pv ^= (uint16_t)(pv < < 8); \ pv ^= (uint16_t)(pv << 4); \ pv ^= (uint16_t)(pv << 2); \ pv ^= (uint16_t)(pv << 1); \ (uint16_t)(pv & 0x0001); \ }) [/c]

It uses GCC’s handy compound statement syntax, but otherwise it’s plain old C. Let’s look at the 64-bit ASM this compiles to at -Os: Continue reading

SHA-256 in MySQL!

I eventually got sick of having to patch MySQL with the OpenSSL SHA-2 functions every time I updated it. Nevermind that there was a SHA2() function in MySQL 6.0, and that 6.0 got completely yanked, and that no one ever backported it to 5.1 or 5.4… Anyway, it’s very slow (150 seconds to hash a 1MB data block), but it’s not too complex. I hereby place it in the public domain.

This code is intended to be saved to a file which can be fed to the mysql command. It has been tested using MySQL 5.1 and the three reference hashes given by http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf, appendix B. It has also been compared to PHP’s implementation and shown to produce the same message schedules and per-pass a-f values.

Update: Now using bitwise AND instead of MOD for the 32-bit math, which I should’ve done all along. It sped things up slightly, but the real CPU usage is still in the enormous number of SUBSTRING calls.

Update 2: Combined the two loops and saved a bit of unneeded data storage. Shaved another few seconds off the 2-minute runtime.

Update 3: Reformatted slightly to better fit my new blog theme. The indentations are less logically sensible this way, sadly.

Continue reading