Preface: I’m back (again) [shades of Men In Black 2, sort of]. If you don’t understand what that means, don’t worry about it.
In the Mac OS X ABI Function Call Guide there is an innocent little sentence: "The stack is 16-byte aligned at the point of function calls." We’ve not been able to find out why this is required for the IA-32 environment, but they really mean it, and there are deep implications.
OS X is, under the covers, your basic *nix system, making heavy use of shared libraries. When your code references a function in a shared library, a little call stub is constructed by the linker, and the loader fixes up this stub to point to a loader helper function which will perform a lazy bind to the function. Take a look at the first instructions of that helper function:
0x8fe18be0 <__dyld_fast_stub_binding_helper_interface+0>: push 0x0 0x8fe18be2 <__dyld_stub_binding_helper_interface+0>: sub esp,0x64 0x8fe18be5 <__dyld_stub_binding_helper_interface+3>: mov DWORD PTR [esp+0x54],eax 0x8fe18be9 <__dyld_stub_binding_helper_interface+7>: mov eax,DWORD PTR [esp+0x68] 0x8fe18bed <__dyld_stub_binding_helper_interface+11>: mov DWORD PTR [esp+0x60],eax 0x8fe18bf1 <__dyld_stub_binding_helper_interface+15>: mov DWORD PTR [esp+0x68],ebp 0x8fe18bf5 <__dyld_stub_binding_helper_interface+19>: mov ebp,esp 0x8fe18bf7 <__dyld_stub_binding_helper_interface+21>: add ebp,0x68 0x8fe18bfa <__dyld_stub_binding_helper_interface+24>: mov DWORD PTR [esp+0x58],ecx 0x8fe18bfe <__dyld_stub_binding_helper_interface+28>: mov DWORD PTR [esp+0x5c],edx 0x8fe18c02 <__dyld_misaligned_stack_error+0>: movdqa XMMWORD PTR [esp+0x10],xmm0 0x8fe18c08 <__dyld_misaligned_stack_error+6>: movdqa XMMWORD PTR [esp+0x20],xmm1 0x8fe18c0e <__dyld_misaligned_stack_error+12>: movdqa XMMWORD PTR [esp+0x30],xmm2 0x8fe18c14 <__dyld_misaligned_stack_error+18>: movdqa XMMWORD PTR [esp+0x40],xmm3
Note the last four instructions. If you backtrack to the first couple of instructions, you’ll see that ESP gets tweaked by a total of 0×68 bytes. Thus, if the stack isn’t aligned to an 8 byte boundary on entry to this helper, the four instructions above will definitely kill you. The symbolic name that GDB reports for these instructions makes me infer that this is a gate-keeper function intended solely to ensure that the ABI stack alignment requirement is met. If you wonder why the alignment on entry has to be 8, recall that we’ve stepped through a linker constructed thunk. So if you back off the return address of that thunk, you have 4 bytes, which is the return address into the user code. Back off that return address, and you are down to the 16-byte alignment that the ABI requires at the point of the function call.
This little gate-keeper is draconian since from a practical standpoint it means we have to maintain the 16-byte stack alignment in pretty much all of our code. The only time you can avoid keeping the stack aligned is if you can guarantee that the call tree you are dealing with will never escape your local unit (in the case of Pascal code). Why? Because at compile time, you cannot guarantee that any particular unit you are making a reference to will not be in a package, and hence reached via the dynamic loader.
For those of you thinking about Mac OS X, this means that you should do some planning. If you have assembly code which is not leaf code, you should inspect it very carefully to see that none of the call trees can escape the unit in which the code lives. Remember that the compiler will use helper functions in native Pascal code for a lot of things, like reference counting interfaces, copying strings, etc. Those helpers live in the System unit, and if you are linking against packages, you’ll go through the dynamic loader to get to them. Even if you don’t link against packages, some helpers might drop to the O/S for memory allocation, which will go through the dynamic loader. So if your assembly code calls what looks like a leaf function in the same unit, where that function is implemented in Pascal, you have to make sure that no helpers are used, if you don’t maintain stack alignment in your assembly code. Your alternative is to go through the assembly code, and manually align the stack prior to each call. I can tell you that I did that for a bunch of code, and it made my teeth ache.
Another option to consider is to just keep Pascal versions of your assembly coded routines, and use those, at least initially, on OS X. There are plenty of good reasons to keep around high level versions of these routines anyway. I’m very fond of this option, personally.
It’s somewhat interesting to look at what gcc does for stack frame construction on OS X. They always build a full EBP frame, and adjust the stack by the largest amount of local storage required up front in the prolog. The stack is aligned once at that point, and from then on, no PUSH instructions are used. I believe this is more efficient in the long haul, but it requires very different management of function return results when building parameter lists where there are nested function calls in the parameter lists.
So to recapitulate, on entry to your functions, you will find the stack aligned to 16-bytes, minus the return address. In other words, ESP will always be 0xnnnnnnnC. If you want to call a function in another unit, you have to ensure the stack is aligned at the point of the call instruction. Here are some examples:
procedure myAsmFunc; asm // ESP will be 0xnnnnnnnC // call procedure A(a: integer, b: integer, c: integer); cdecl; push 0 push 1 push 2 call A // OK, because ESP is now 0xnnnnnnnC - 12 add esp, 12 // call procedure B(a: integer, b: integer); cdecl; push 0 push 1 call B // _NOT_ OK, because ESP is now 0xnnnnnnnC - 8 add esp, 8 // call procedure B(a: integer, b: integer); cdecl; push ecx // add a dword to make the alignment come out push 0 push 1 call B // that's better, because ESP is now 0xnnnnnnnC - 12 add esp, 12 end;
If you write pure Pascal code, then the compiler will take care of all the alignment work for you. Once you’ve done the hand alignment tweaking on existing, beautifully tuned IA-32 assembly code you may find your opinions shifting around on you with respect to how to approach the problem. Best of luck to you.
Oh, one more thing - you have to be careful with your testing of the manual alignment solution. Just because it runs doesn’t mean you got it right. Once you make the first call to the dynamic loader function, the OS will back-patch the thunk with the actual function address that you wanted to bind to. So if some other bit of code calls that thunk before your assembly code calls it, the gate-keeper code may no longer be on duty. It’s best if you single step the code, and verify that ESP & 0xF == 0 at all call instructions.Posted by Eli Boling on May 20th, 2009 under C_Builder, Delphi, OS X |