Code alignment on x86

Tags: Lisp, Blog

2009-03-09 -- There has always been a lot of wiggling of SBCL boinkmarks results. It's easy to chalk this up to system load, but the same can be observed running the cl-bench benchmarks under more ideal circumstances. Part of the reason is the insufficient number of iterations of some tests: measurement accuracy is really bad when the run time is below 0.2s and it is abysmal when there is other activity on the system which is easy to tell even in retrospect by comparing the real and user time columns.

But that's not the end of the story, take for instance FPRINT/PRETTY: it takes more than two seconds but often experiences changes up to 7% caused by seemingly unrelated changes. People have fingered alignment as a likely cause.

Recently this issue has become more pressing as I've been trying to reduce the overhead of x86's pseudo atomic. Unfortunately, the effect is smallish which makes measurement difficult so I tried aligning loops on 16 byte boundaries. This being on x86, that meant aligning code similarly first (it's 8 byte aligned currently).

The change itself (from the allocate-code-object branch of my git tree) is rather trivial, the effects are not. It turns out that depending on the microarchitecture some x86 CPUs like alignment, while the rest should really not care much. In practice, other factors come into play at which we can only guess. It certainly seems that the Core Duo (and likely the Pentium M) is so deeply unnerved by a jump instruction near the end of a 16 byte block that it cannot execute the loop at its normal speed.

This led to an experiment where the compiler was modified to pad innermost loops with a few preceding NOPs so that their ends either stay at least 3 bytes from the end of the block or spill over it by at least one byte. However, on a quick and dirty implementation of the above there is no discernible improvement. It may be that in practice even the tightest loops are already longer than 16 bytes ...

For now, here are the cl-bench results from a 32 bit binary on an Opteron, a PIII, a P4, a Core Duo system.

There may be a slight improvement but its magnitude is pretty small compared to the noise. I'm declaring the evidence inconclusive and let the commit stay out of the official SBCL tree.