Gwynne's Blog

Sa souvraya niende misain ye

Highlighting source code in WordPress with Pygments »« Clang: one savvy code analyzer!

Pointless optimization 2 Comments

So I was looking at the macro used to calculate 16-bit parity in pure C without branching:

#define parity(v)   ({ \
        uint16_t pv = (v); \
        pv ^= (uint16_t)(pv << 8); \
        pv ^= (uint16_t)(pv << 4); \
        pv ^= (uint16_t)(pv << 2); \
        pv ^= (uint16_t)(pv << 1); \
        (uint16_t)(pv & 0x0001); \
    })

It uses GCC’s handy compound statement syntax, but otherwise it’s plain old C. Let’s look at the 64-bit ASM this compiles to at -Os:

movw    0xfc(%rbp),%ax
movzbl  %ah,%ecx
xorw    %ax,%cx
movw    %cx,%ax
shrw    $0x04,%ax
xorw    %cx,%ax
movw    %ax,%cx
shrw    $0x02,%cx
xorw    %ax,%cx
movw    %cx,%ax
shrw    %ax
xorw    %cx,%ax
andb    $0x01,%al
movb    %al,0xff(%rbp)
movzbl  0xff(%rbp),%esi

Whew! That’s fifteen instructions (optimized, mind you!) to find out something that x86 processors have calculated internally for most operations all the way back to the venerable 8086: The parity flag (PF) in the processor status register.

Optimizing this is the most silly of pointless microptimizations, as I call the parity() macro once per loop in an Objective-C method dispatched through the NSObject machinery which involves floating-point division. But it nagged me. So I dug up the AMD64 instruction set manual, the x86_64 ABI specification, and GCC's docs on embedded assembly and set to work getting at that darn parity flag.

Those of you familiar with the x86 family, ancient or modern, will know that getting at a particular bit in the RFLAGS register, especially the original 16 FLAGS bits, can be a real pain. Your options, in 64-bit land, are CMOV*, LAHF, J*, SET*, and PUSHF.

Unfortunately, at the start of this I was only familiar with the 8086 modes, which meant no CMOV and no SET. Fortunately - maybe - I discovered CMOV almost immediately. This was my first successful attempt at parity calculation:

#define parity(val) ({ \
        register uint16_t v = (val); \
        register uint32_t pe = 0, pe2 = 0, one = 1; \
        asm ( "xorw $0, %[v]\n\t" \
              "cmovpel %[one], %[pe]\n\t" \
              "shrw $8, %[v]\n\t" \
              "cmovpel %[one], %[pe2]\n\t" \
              "andl %[pe2], %[pe]" : \
              [v] "+r" (v), [pe] "+r" (pe), [pe2] "+r" (pe2) : \
              [one] "r" (one) : \
              "cc" \
        ); \
        !pe; \
    });

Crying yet? Sorry, I don't do a lot of work in assembly these days. Not that GCC's inline assembly constraint syntax is in any way easy to make sense of. Keep in mind that I was inverting the sense of pe to match the semantics of GCC's __builtin_parity(), which sadly compiled down to even more instructions than the original pure C macro. Anyway, I eventually reduced that horror quite a bit, ending up with this:

#define parity(val) ({ \
        register uint16_t v asm("dx") = (val); \
        register uint32_t pe = 0, one = 1; \
        asm ( "xorb %%dh, %%dl\n\t" \
              "cmovnpl %[one], %[pe]\n" : \
              [pe] "+r" (pe) : \
              [one] "r" (one) : \
              "dx", "cc" \
        ); \
        pe; \
    });

Anyone reading this who is familiar with GCC inline asm constraints is definitely crying by now. Those extra two registers I was using to store "pe" and "one" were annoying the heck out of me, especially since "one" was a constant and the only reason I needed it is that CMOV has no immediate operand form. It finally occurred to me to keep looking at the AMD64 instruction set, and there I found the holy grail, so to speak: the SET instruction. This was the result:

#define parity(val) ({ \
        register uint16_t v = (val); \
        register uint8_t pe; \
        asm ("xorb %%dh, %%dl\n\tsetnp %0" : "=r" (pe) : "d" (v) : "cc"); \
        pe; \
    });

Notes:

  • I chose to use DX because 1, I needed a register that could be addressed as two bytes at once, which means one of the four legacy registers under AMD64, and 2, it's the least likely of those four to force the compiler to have to reallocate registers around it.

I find a deep elegance in the fact that a 16-bit word is the most efficient possible width to calculate parity for under the x86 architecture. It has the unique property of requiring only a single XOR operation to compress into a byte, which is essential since the Parity Flag is only calcuated for the lowest 8 bits of the result of an instruction that affects it. That one operation both compresses the word and calculates the parity simultaneously. To get the parity of an arbitrary byte would require the effective no-op XOR $0, %dl to set the Parity Flag. It certainly made me happy that I was able to reduce the gobbletygook of GCC asm constraints to three simple operands: writes to the "pe" variable, reads from "v" which must be placed in DX, and clobbers the condition code register.

I have no doubt that it will be quickly explained to me by someone that the mess of mask words GCC's __builtin_parity() uses is somehow more cache-efficient, or more streamlined for a SMP architecture, etc. etc., but as I think is obvious by now, I'm a programmer from the old school of assembly, where every instruction counts. Winding 15 largely redundant instructions down to 2 makes me feel good inside.

, , , , , ,
January 25, 2010 at 12:45 pm
2 comments »
  • June 10, 2010 at 4:50 pmMaloeran

    I’m pretty sure __builtin_parity() uses the parity flag as you did.

    The difference would be that __builtin_parity() doesn’t know that you only care about the 16 lower bits… so it’s going to do some extra shifting and xoring. If you only care about the lower 16 bits, your solution is certainly more efficient!

    In any case, interesting post :).

  • June 12, 2010 at 4:31 pmGwynne Raskind

    Yep, that was what it did when I tested with GCC. I was too lazy to try LLVM’s version :).

Leave a Reply

Powered by WP Hashcash