Calling Convention Hacks
2009-04-19 -- SBCL's calling convention is rather peculiar. Frames are allocated and mostly set up by the caller. The new frame starts with a pointer to the old frame, then comes the return address, an empty slot and the stack arguments (the first three are passed in registers on x86).
Software archeology aside, the only reason I can see for this scheme
is that stack arguments are easier to manipulate when they are after
the return address, old frame pointer part, in particular tail calls
with any number of arguments can be made without re[
al]
locating
the frame.
The first step towards callee allocated frames is swapping the
return address and old fp slots. Asking an innocent question on
#lisp
accomplished most of the work as Alastair
Bridgewater had a patch for x86 against
a 0.9ish version that does exactly this.
Forward porting it to current SBCL was a breeze. Relatively speaking, of course, because debugging cold init failures is never pleasant. He also had another patch that biases the frame pointer to point to the old fp slot instead of just before the return address. This has the benefit of making Lisp and foreign frame layouts the same which makes backtraces more reliable and allows external debugging tools recognize all frames.
Callee allocated frames are still quite some way off, but while in
the area I sought a bit of optimization fun. With the return address
before old fp it is now possible to return with the idiomatic POP
EBP, RET
sequence. Well, most of the time: when more multiple
values are returned than there are argument passing registers they
are placed on the stack exactly where the arguments normally reside.
Obviously, in this case the frame cannot be dismantled.
Strangely, turning these JMP
s into RET
s in multiple value return
has no measureable effect on performance even though it results in
more paired CALL
s. What about the other way, addressing unpaired
RET
s by turning JMP
s to local call'ed functions into CALL
s? I
tried a quick hack that CALL
s a small trampoline that sets up the
return pc slot and JMP
s to the target. With this small change a
number of benchmarks in the cl-bench suit benefit greatly: TAK
,
FIB
, FIB-RATIO
, DERIV
, DIV2-TEST-2
, TRAVERSE
, TRIANGLE
gain about 25-50%. See results for
P4 and 64 bit
Opteron.
This should take off another chunk off the proposed and already partly done Summer of Code project for improving the x86 and x86-64 calling convention although a nicer solution may be possible. As to the future, it is unclear to me how callee allocated frames would pan out. Code for the current batch of changes is here.