A QTKit-based video recorder is now integrated into the code. I tried about twenty ways to get it to record audio too, but between CoreAudio’s failings and QTKit’s limitations, nothing both sounded correct and remained correctly synchronized.
- Capture the sound output of the game and add it as a sound track to the video. Failure reason: CoreAudio provides insufficient API to do this when using OpenAL.
- Pipe the sound output through SoundFlower and add it as a sound track to the video. Because OpenAL is broken on OS X, this necessitated changing the *system* default audio output device to use SoundFlower. Failure reason: Because video was recorded one frame at a time, with the accompanying delays necessary to compress each frame, while the audio was recorded in realtime, synchronization was impossible.
- Pipe the output through SoundFlower and manipulate the audio data to solve the synchronization issues. Failure reason: QTKit, unlike the original QuickTime API, provides no API whatsoever for manipulating raw audio data in a movie.
- Add the sounds used by the game as tracks to the video. Failure reason: QTKit’s API again proved unequal to the task, even in the Snow Leopard version and using SPI, an approach quickly abandoned.
- Record each sound event, construct a sound track from those events, and add that track to the video. Failure reason: QTKit’s API once again.
- Forgo QTKit entirely and use FFmpeg to do the media authoring. Failure reason: The documented
-itsoffsetflag is not implemented by the FFmpeg commandline driver, nor correctly supported by the supporting libraries. - Manually manipulate every input sound file to have the necessary time of silence at the beginning, then pipe through FFmpeg or QTKit. Failure reason: The entire effort was becoming ridiculous, and I felt my time would be better spent working on the actual game and worrying about something like that much later, especially since there was no need for it at all.
In every case, QTKit either had no API to accomplish the task, or its provided APIs didn’t work correctly, as with FFmpeg. I wasn’t able to drop back to the old QuickTime API because it isn’t supported in 64-bit code and I intended this game to be forward-compatible.
There was one interesting side note to all this. In the process of recording video frames, I naturally ran into the issue that OpenGL and QuickTime have flipped coordinate systems relative to each other. Rather than play around with matrices, I wrote a quick in-place pixel flipping routine:
- (void)addFrameFromOpenGLAreaOrig:(NSRect)rect { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSUInteger w = rect.size.width, h = rect.size.height, rowBytes = w * sizeof(uint32_t), i = 0, j = 0, rowQs = rowBytes >> 3; void *bytes = [[NSMutableData dataWithLength:h * rowBytes] mutableBytes]; uint64_t *p = (uint64 *)bytes, *r = NULL, *s = NULL; NSImage *image = [[[NSImage alloc] init] autorelease]; glReadPixels(rect.origin.x, rect.origin.y, w, h, GL_RGBA, GL_UNSIGNED_BYTE, bytes); for (i = 0; i < h >> 1; ++i) for (j = 0, r = p + (i * rowQs), s = p + ((h - i) * rowQs); j < rowQs; ++j, ++r, ++s) *r ^= *s, *s ^= *r, *r ^= *s; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&bytes pixelsWide:w pixelsHigh:h bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:rowBytes bitsPerPixel:32] autorelease]]; [self addFrame:image]; [pool drain]; }
No doubt the more skilled among you can see the ridiculous inefficiency of that approach. Through staring at the code a great deal, I was able to reduce it to:
- (void)addFrameFromOpenGLArea:(NSRect)rect { // All of this code assumes at least 16-byte aligned width and height // Start r at top row. Start s at bottom row. // For each row, swap rowBytes bytes (in 8-byte chunks) of r and s, incrementing r and s. // Width = the number of 8-byte chunks in two rows (rb = w * 4, rq = rb / 8, times two rows = ((w*4)/8)*2 = (w/2)*2 = w NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSUInteger w = rect.size.width, h = rect.size.height, i; uint64_t *p = malloc(h * w << 2), *r = p, *s = p + (h * (w >> 1)); NSImage *image = [[[NSImage alloc] init] autorelease]; glReadPixels(rect.origin.x, rect.origin.y, w, h, GL_RGBA, GL_UNSIGNED_BYTE, p); for (; s > r; s -= w) for (i = 0; i < w; i += 2) *r ^= *s, *s ^= *r, *r++ ^= *s++; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&p pixelsWide:w pixelsHigh:h bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:w << 2 bitsPerPixel:32] autorelease]]; [self addFrame:image]; free(p); [pool drain]; }
Much better, but still pretty inefficient when the size of every single frame is the same. Why keep redoing all those width/height calculations and buffer allocation and defeat loop unrolling? So I wrote a specialized version for 640×480 frames, with all the numbers precalculated.
- (void)addFrameFromOpenGL640480 { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; if (frameBuffer == NULL) frameBuffer = malloc(1228800); glReadPixels(0, 0, 640, 480, GL_RGBA, GL_UNSIGNED_BYTE, frameBuffer); register uint64_t i, *r = frameBuffer, *s = r + 153280; for (; s > r; s -= 640) for (i = 0; i < 40; ++i) { *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; *r ^= *s, *s ^= *r, *r++ ^= *s++; } NSImage *image = [[[NSImage alloc] init] autorelease]; [image addRepresentation:[[[NSBitmapImageRep alloc] initWithBitmapDataPlanes:(unsigned char **)&frameBuffer pixelsWide:640 pixelsHigh:480 bitsPerSample:8 samplesPerPixel:4 hasAlpha:YES isPlanar:NO colorSpaceName:NSDeviceRGBColorSpace bitmapFormat:0 bytesPerRow:2560 bitsPerPixel:32] autorelease]]; [self addFrame:image]; [pool drain]; }
I took a look at the code the compiler produces at -O2, and I’m fairly sure that its assembly will run parallelized, though not actually vectorized.
Yes, I’m fully aware that glReadPixels() is slower than creating a texture. I was testing my optimization skill on the low-level C stuff, not the entire routine. I only regret I didn’t have the patience to try doing it in raw SSE3 assembly, because I recognize an algorithm like this as being ideally suited to vector operations.