Sunday, April 19 2009 @ 00:00 +0200
SBCL's calling convention is rather peculiar. Frames are allocated and mostly set up by the caller. The new frame starts with a pointer to the old frame, then comes the return address, an empty slot and the stack arguments (the first three are passed in registers on x86).
Software archeology aside, the only reason I can see for this scheme is that stack arguments are easier to manipulate when they are after the return address, old frame pointer part, in particular tail calls with any number of arguments can be made without re[al]locating the frame.
The first step towards callee allocated frames is swapping the return address and old fp slots. Asking an innocent question on #lisp accomplished most of the work as Alastair Bridgewater had a patch for x86 against a 0.9ish version that does exactly this.
Forward porting it to current SBCL was a breeze. Relatively speaking, of course, because debugging cold init failures is never pleasant. He also had another patch that biases the frame pointer to point to the old fp slot instead of just before the return address. This has the benefit of making Lisp and foreign frame layouts the same which makes backtraces more reliable and allows external debugging tools recognize all frames.
Callee allocated frames are still quite some way off, but while in the
area I sought a bit of optimization fun. With the return address
before old fp it is now possible to return with the idiomatic
RET sequence. Well, most of the time: when more multiple values are
returned than there are argument passing registers they are placed on
the stack exactly where the arguments normally reside. Obviously, in
this case the frame cannot be dismantled.
Strangely, turning these JMPs into RETs in multiple value return has
no measureable effect on performance even though it results in more
paired CALLs. What about the other way, addressing unpaired RETs by
turning JMPs to local call'ed functions into CALLs? I tried a quick
hack that CALLs a small trampoline that sets up the return pc slot and
JMPs to the target. With this small change a number of benchmarks in
the cl-bench suit benefit greatly:
TRIANGLE gain about 25-50%. See results
for P4 and 64 bit Opteron.
This should take off another chunk off the proposed and already partly done Summer of Code project for improving the x86 and x86-64 calling convention although a nicer solution may be possible. As to the future, it is unclear to me how callee allocated frames would pan out. Code for the current batch of changes is here.