Calling Convention Hacks
Tags: lisp
, Date: 2009-04-19
SBCL's calling convention is rather peculiar. Frames are allocated and mostly set up by the caller. The new frame starts with a pointer to the old frame, then comes the return address, an empty slot and the stack arguments (the first three are passed in registers on x86).
Software archeology aside, the only reason I can see for this scheme is that stack arguments are easier to manipulate when they are after the return address, old frame pointer part. In particular, tail calls with any number of arguments can be made without re[al]locating the frame.
The first step towards callee allocated frames is swapping the
return address and old fp slots. Asking an innocent question on
#lisp
accomplished most of the work as Alastair
Bridgewater had a patch for x86 against
a 0.9ish version that does exactly this.
Forward porting it to current SBCL was a breeze. Relatively speaking, of course, because debugging cold init failures is never pleasant. He also had another patch that biases the frame pointer to point to the old fp slot instead of just before the return address. This has the benefit of making Lisp and foreign frame layouts the same which makes backtraces more reliable and allows external debugging tools recognize all frames.
Callee allocated frames are still quite some way off, but while in
the area, I sought a bit of optimization fun. With the return
address before old fp, it is now possible to return with the
idiomatic POP EBP, RET
sequence. Well, most of the time: when more
multiple values are returned than there are argument passing
registers, they are placed on the stack exactly where the arguments
normally reside. Obviously, in this case the frame cannot be
dismantled.
Strangely, turning these JMP
s into RET
s in multiple value return
has no measureable effect on performance even though it results in
more paired CALL
s. What about the other way: addressing unpaired
RET
s by turning JMP
s to local call'ed functions into CALL
s? I
tried a quick hack that CALL
s a small trampoline that sets up the
return pc slot and JMP
s to the target. With this small change a
number of benchmarks in the cl-bench suit benefit greatly: TAK
,
FIB
, FIB-RATIO
, DERIV
, DIV2-TEST-2
, TRAVERSE
, TRIANGLE
gain about 25-50%. See results for
P4 and 64
bit
Opteron.
This should take off another chunk off the proposed and already partly done Summer of Code project for improving the x86 and x86-64 calling convention although a nicer solution may be possible. As to the future, it is unclear to me how callee allocated frames would pan out. Code for the current batch of changes is here.