Sometimes it’s better to do the obvious than try to be “correct”.
How would you check if a floating-point variable x was zero? Common sense says
x == 0.0 should work, right? But the compiler gets cranky about floating-point compares, yet zero is certainly a valid sentinel value even for floating-point. So I found the
fpclassify() function. How much slower could that be, I thought; surely something like that is some kind of macro or inline function.
I made an assumption. Oops.
Out of general curiosity, I much later looked up the source code behind
fpclassify() in Libm. Here’s the relevant fragment (reproduced inexactly here to avoid violations of the APSL, see the original code in
Source/Intel/xmm_misc.c in the Libm-315 project at http://opensource.apple.com if curious):
if (__builtin_expect(fabs(d) == 0.0, 0)) return FP_ZERO;
So I was taking the hit of two function calls, a branch prediction miss (well, only on PPC, Intel architectures don’t have prediction control opcodes), and a load/store for the return value, plus the integer compare versus FP_ZERO, where I could have just done it the obvious way and saved a lot of trouble. Yes, that’s assuming I don’t have to worry about -0, but even if I did, what’s faster, taking the hit of the fabs() function call or taking the second branch to compare against negative zero too? For reference,
fabs() on a double, implemented in
copysign.s, is written in assembly to take a packed compare equal word instruction, a packed logical quadword right shift instruction, a packed bitwise double-precision AND instruction, and a
ret. Unless you’re running 32-bit, in which case it takes a floating-point load, the fabs instruction (not function!), and the ret. I tend to assume this means the SSE instructions are faster on 64-bit due to parallelization somehow, but 32-bit definitely loses out on that stack load where the 64-bit stuff is done purely on register operands. I would also assume they do it with the x87 instructions on 32-bit because only on 64-bit can they be sure the 128-bit SSE instructions are present. Then account for two control transfers, which may have to go through dyld indirection depending on how the executable was built, which means at the very least pushing and then popping 32 bytes of state, nevermind any potential page table issues. It’s a damn silly hit to take if I never have to worry about negative zero! I can safely guess, without even running a benchmark, that
x == 0.0 is a whole heck of a lot faster than
fpclassify(x) == FP_ZERO.
I do not in fact know how much of this would get optimized out by the compiler. GCC and LLVM are both pretty good with that kinda thing. But there’s no
__builtin_fpclassify() in GCC 4.2! It doesn’t exist until at least 4.3, possibly 4.4. I can’t find it in Apple’s official version of Clang either! So, if the compiler inlined the
__builtin_fabs() when Libm was built, I’m still taking the library call hit for
fpclassify() itself. For reference, the simple compare is optimized to
setnp, though GCC 4.2 and LLVM/Clang use different register allocations to do it.
Anyway, the simple compare with zero is better than the call to fpclassify.