Mac OS X shared library initialization
I was part of the team that was passed through the Kylix threshing machine originally, so I decided to do a little research into shared library initialization and termination early on. Some very difficult to debug things can happen to you if you get surprised by library load/unload sequencing on a platform.
When we did the original linux work, we started out with the assumption that the loader worked the same as Windows with respect to initialization order. We assumed dependencies were initialized first, in a depth first ordering. I don’t remember anymore what the default ordering was, but it wasn’t right. We found out that the tools are responsible for supplying the shared object initialization order in a special section in the executable. I didn’t want to have a similar experience late in the development cycle for the Mac.
The first thing to do was to find out how shared objects (e.g. .dylib files) specify initialization procedures. If you look at the MACH-o spec, or in mach-o/loader.h, you’ll find the LC_ROUTINES load command, which seems just right for the job. Comments in the header say this is used for C++ static constructors. Excellent! A little further reading shows the ld option -init allows you to specify the startup routine manually, and directs you to the man page for ld. OK, off we go, and there we see only one sentence that includes a clause stating that this is used rarely. Red flag. And there the documentation trail dies. Red flag. OK, let’s go look at a C++ static constructor example built with g++. Hmm, no LC_ROUTINES. Red flag. OK, let’s look for LC_ROUTINES in some of the .dylibs in the shipping OS. Hmm, no LC_ROUTINES. Red flag. So, how do C++ static constructors get called? Looking around a bit (see otool -lv), I see sections like __mod_init_func, with the S_MOD_INIT_FUNC_POINTERS flag set. Good luck finding documentation on the S_MOD_INIT_FUNC_POINTERS flag (yeah, ok, red flag).
So, here’s where we are saved a ton of time from the fact that the OS is open source. I have the DYLD source handy, and a little grepping and reading gives a pretty clear picture of how things really work. LC_ROUTINES is only suitable for calling an initialization routine. LC_ROUTINES does not support a termination notification. So now LC_ROUTINES is a porcupine of red flags, and I wouldn’t touch it with a ten foot pole. S_MOD_INIT_FUNC_POINTERS and S_MOD_TERM_FUNC_POINTERS are the way to go. These indicate sections that are arrays of pointers to functions that the OS calls for both the executable and for shared objects, if it finds them. Each function in the S_MOD_INIT_FUNC_POINTERS section is called with a mess of parameters, which I’ll describe later. Each function in the S_MOD_TERM_FUNC_POINTERS section is called with no parameters at all. The functions in the S_MOD_TERM_FUNC_POINTERS array are called in reverse order. So for init, it’s init[0], init[1], …, init[n], and for term it’s term[n], term[n-1], …, term[0]. Without source, this would have been a long slow learning experience.
The loader source code also indicates a recursive descent initialization of the shared objects, but the devil is in the details, and rather than try to fully grok the source code, and possibly miss something critical, I decided to switch back to the empirical, and write a bunch of tests. Armed with the details of how the loader wants us to set up the init and term functions, I descended on our C++ linker, and brought it to the point where it could build a Mac .dylib (mostly). Many of the results are mundane, but a few are sort of interesting, and I’ll try to present them here compactly.
First some notation. In all the examples below, X refers to the executable. Other letters refer to shared objects. Solid directed lines indicate a static dependency between objects. For example, X -> A, means the executable depends on the shared library A, and we’d expect A to be initialized before X runs. Also we’d expect A to be terminated after X runs. Dotted lines in the examples mean we dynamically load a shared object with dlopen. That’s where a lot of the really horrible things happened in Kylix, because of all the package load/unload operations, so I invested some tests there. To show the actual initialization results, I’m going to use a very simple notation. I use the letter of the object for initialization, and the same letter, preceded by ~ to indicate termination. So for the example X -> A, I expect the following sequence: A X ~X ~A.
So, simple tests first.
X -> A
result: A X ~X ~A
A
/ ^
X |
\ |
B
result: A B X ~X ~B ~A
And a diamond:
B
/ \
X -> D A
\ /
C
result: A B C D X ~X ~D ~C ~B ~A
OK, moving on to a complicated static linking example. This one is a ladder, and here we should do a little explanation.
D -> C -> B -> A
/ | ^ | ^
X | | | |
\ v | v |
H -> G -> F -> E
In the example above, we have two straight legs of dependencies, with cross dependencies running back and forth between the legs. You can’t do a naive recursive depth first initialization of this, because you could end up running some initializers before their dependents are initialized. You have to do a topological ordering of the dependencies, and initialize in that order. That can be done with a recursive operation, and it is done that way by the Mac loader. The easiest way to observe it is to follow all the dependency sequences in the ladder above. Here they are:
DCBA DCBFEA DHGFEA DHGCBA DHGCBFEA
The last one is the longest chain. In all the chains, D is at the top. The longest one is the initialization chain we need, since it guarantees all dependents will be initialized first. It is roughly what we expect out of the loader, barring ties in initialization order due to libraries at the same level in the dependency graph.
Here’s what we got when we ran the example: A E F B C G H D X ~X ~D ~H ~G ~C ~B ~F ~E ~A
OK, so things are looking up. I am getting what I expect out of the loader, and I’m not terribly surprised, because otherwise the world would have fallen apart long ago for the Mac, right? That’s what we thought on Kylix, and it’s why I looked so closely here. Anyway, now on to the dynamic loading cases. This is where we mix static loading with dynamic loading, and make sure the semantics are reasonable, and don’t require us to jump through too many hoops in the linker and RTL.
X -> A . . B -> C
In this example, X loads B with dlopen, and then unloads it with dlclose before exiting. B depends on C, but not A. Here’s what we got: A X [load B] C B [unload B] ~B ~C ~X ~A.
Now, just for fun, let’s try the same case, but let’s not call dlclose in X. Meaning we orphan the B shared library. Here’s where it gets harder to guess what would happen. Here’s what we got: A X [load B] C B ~B ~C ~X ~A. Yep, same as the case where we called dlclose! I certainly wouldn’t rely on this feature, but it’s nice to know about.
Now, a more complicated case where we try out load and unload sequences.
X --> A . ^ . | . . . B . ^ . | C ----
In the case above, there’s a special runtime sequence. X loads B, then loads C, then unloads B and unloads C. The interleaved order is purposeful - to be mean. There are no cyclical dependencies involved, but there are reference count issues. C depends on B, so when we manually unload B in X, we hope the OS doesn’t call the terminate routine in B, because C still needs B. And here’s what we got: A X [load B] B [load C] C [unload B] [unload C] ~C ~B ~X ~A. That’s what we wanted.
And for our last example, we’ll include a cyclical case. This is a case where the user manually, via dlopen, causes a cyclical dependency to appear in the shared object dependencies. This is bad, and there is no good way for the OS to resolve it. Still, we’d like to know what happens. For example, one outcome could have been for the OS to panic and kill the task. The way we create the cycle is to have a shared object use dlopen in its initialization routine to load another shared object that has a static dependency on something further up the initialization chain that hasn’t been initialized yet.
X -> D -> C -> B -> A
^ .
| .
| .
-------- E
And what we got was this: A B E C D X ~X ~D ~C ~E ~B ~A
So what happened was that the OS stopped the initialization of the library loaded with dlopen at the point where it hit pending dependencies in an existing initialization chain. That meant the loaded library E had its initializer run with some dependencies uninitialized (C and D). This protected the main executable’s dependencies, sort of, but either way, the user is probably hosed. As a basic rule of thumb, people should avoid calling dlopen in a library initializer anyway unless they really know what they are doing.
So there’s a basic way of summarizing the results here:
- Each shared library has its init/term routine(s) called once and only once per process.
- The list of shared libraries for a task is initialized in one order, and terminated in the reverse order.
- All dependencies are initialized before their dependents.
This is a good result. It’s similar to Windows, and it appears to meet our basic requirements for package initialization code in the RTL without us having to add more magic to the tools over and above what we’ve done in the past to preserve initialization order.
Once last thing. I promised to give list the parameters to the initialization functions in the S_MOD_INIT_FUNC_POINTERS sections.
void func(int argc, const char **argv, const char **envp, const char **apple, struct ProgramVars *pvars);
And ProgramVars is this:
struct ProgramVars {
const void* mh;
int* NXArgcPtr;
const char*** NXArgvPtr;
const char*** environPtr;
const char** __prognamePtr;
};
The interesting item here, really, is the mh field. That one is actually the Mach header for the image, which is really nice, because it means on OS X, a shared library has a reliable way of finding useful things about itself, like the precise start of sections in memory. I don’t know why there is the redundancy of information in the parameters passed to the init function, and the fields of the struct of the last parameter.
Ideally, none of you out there will never have to worry about this at all, but if you do, there it is.
Share This | Email this page to a friend
Posted by Eli Boling on January 29th, 2010 under C_Builder, Delphi | 3 Comments »Mac OS X Exception Handling
Purists and pragmatists will argue over the feature of converting hardware level exceptions into RTL exceptions. This is the feature where you catch, for example, a memory access violation and keep running your application. Purists will say you should never do that, because you could wind up doing serious damage down the line to users’ data, if you don’t know what caused the bad pointer in the first place. Pragmatists will observe that some very large percentage of these come from things like null pointers resulting from some missing fence test, often triggered by some odd combination of user input, and why would you force the user to lose data by letting the app crash? I’m with the pragmatists here, although there are times when I wish I weren’t. Such as when I have to implement the support in the RTL on our various platforms. I’m going to think about a new metric for measuring the effort involved, based on a unit of length of fingernails chewed over the implementation period, which I think better represents the overall effort than a pure time measure.
On Win32, OS/hardware exceptions and language exceptions are all dispatched through the same mechanism: stack based exception registration records. Handling these and even resuming them is pretty straightforward.
On Linux, OS/hardware exceptions are quite a bit trickier to deal with. Exception handling is typically done using a PC mapped scheme that is vendor dependent. There is no standard for language based exceptions, and all the hardware exceptions are dispatched to signal handlers, which you can install with the POSIX signal handling APIs. There are also some strict protocols that you need to follow with respect to handling Ctrl-C and friends if you want to play by the rules for shell scripts. The POSIX signal handling APIs provide portable interfaces for getting the RTL in the game, but the devil is in the details. For example, when your signal handlers are invoked, you get a signal context sent to you that represents the machine state (processor registers, etc). That’s an opaque pointer, because POSIX really can only go so far as to define/restrict what you can and can’t do with the machine state for hardware exceptions. Thus once you land in your handler, everything becomes platform specific with respect to that context, both in its format, and what you can do with it.
In porting to OS X, I’ve tried to use POSIX as the common point between the Linux and OS X support as much as possible. So I did the same thing with exception handling, reviving the Kylix PC mapped exception handling support. The first thing that I did was to try to discover what I could do with the machine context that was delivered to my signal handlers.
Here’s the POSIX spec for your key signal handling setup API: http://www.opengroup.org/onlinepubs/9699919799/functions/sigaction.html. This API has been updated a bit with respect to its formal documentation in the past several years, but the fundamentals of it haven’t changed since I did the exception handling work in Kylix.
I started on OS X writing a simple C program so that I could investigate the behavioral characteristics of the machine context that is delivered to a signal handler. Specifically, I wanted to make sure that I could resume execution on the client thread/stack in the face of memory access violations (e.g. SIGSEGV), and keep the thread sane while delivering the critical data to the RTL for stack unwinding purposes. My simple test case ran fine, so I moved to the next step, which was to debug the signal handler, and confirm the validity and utility of the machine context. There was an immediate hull breach. I don’t remember all the terrible things that happened, but the nut of it is this: your signal handler will never be invoked if you run the app under GDB. To understand the reasons for this, you have to recall the fundamental structure of OS X.
OS X is built on top of the BSD kernel, using the Mach kernel originally developed by CMU. The various POSIX APIs are layered on top of a Mach layer, which is the real process control layer that matters to low level implementors and tools, such as GDB. GDB uses the Mach exception handling APIs for injecting exception handlers into the process to watch for faults in the app, and the support in GDB doesn’t allow for chaining the exceptions on to signal handlers. Here’s a link to one discussion on the matter: http://archives.devshed.com/forums/bsd-93/catching-floating-point-exception-on-os-x-intel-2293939.html. Scroll down to the bottom, and you will see words like ‘forever’, and ‘long standing’ with respect to describing the behavior in GDB.
So that pretty much crosses off using signal handlers for hardware exceptions on OS X - I can’t very well write off GDB. I had looked around a bit more before and after I ran across that post, and found that folks who did VMs (Java, LISP) ran into the same issues. All the posts said pretty much the same thing: you have to use the Mach exception handling APIs if you want reliable support.
The Mach exception APIs work very differently for processing events. The basic model is that you allocate a port over which you will receive messages describing exceptions. The port can be configured to receive various types of exceptions (e.g. memory access, invalid instruction, and math error). Then you spin up a thread on which you wait for messages on that port. If another thread in your application faults, then the OS will stop that thread and dispatch a message to your exception port. The exception handling thread then processes the message. In our case, that means decoding the exception information, pre-chewing it a bit, and then fiddling with the faulting thread’s context to set the thread to continue in an RTL handler with that pre-chewed information sitting on the stack. Then we tell the OS that we’ve processed the exception message successfully, and we go back to looping looking for exception messages. The OS then restarts the faulting thread, where we land in our language specific RTL code to create the RTL exception object and unwind the stack.
Most of this is all under the covers, so to speak, for the application developer, but there are some points that are of interest:
- If you include SysUtils, because of the need for a thread to watch for exceptions, you will always get a second thread allocated for the application.
- We are not going to support recovery from stack faults. I think we had that requirement in Kylix, too, but we’ve made the decision for real here, and I’m going to phase out any Linux support for it as well at this point. That requirement is not really specific to Mach issues, but represents us giving up on the feature because of the various caveats that we’ve had to place on it from one platform to the next, making it too difficult to support in a uniform and useful fashion.
- Ctrl-C events are not dispatched via Mach exception messages; these are still caught as SIGINT, via POSIX signal handlers. This is actually good, because it means we won’t have to go through too much agony ensuring proper shell script handling semantics. So we’ll have both the Mach exception thread, and some signal handlers in play.
If I go into the gory details of the Mach implementation, this post will get too long, so I’m just going to include some links here to some references that I’ve found useful:
I strongly recommend that last article.
There are a couple of other miscellaneous things of interest to note about the Mach exception support. First, the books/articles/examples variously reference a function called exc_server, and a header file called exc_server.h. That header file doesn’t exist on the Mac, but the function is there, and needed. Second, there is an oddity with exc_server: when you call exc_server, it calls back into your application. However, you don’t tell exc_server the callback function. That function has to have a particular name, and is looked up out of your symbol table at runtime. Get the name wrong, and the app goes down in flames on the first exception.
Share This | Email this page to a friend
Posted by Eli Boling on November 10th, 2009 under C_Builder, Delphi | 4 Comments »OS X malloc
I could write a lot about OS X malloc. Other people already have, and maybe I’ll write some more about it at a later date. I just wanted to point out a couple of things to make people think a little.
The default allocator you get when you call malloc has a bunch of debugging support built in. Some of it is enabled all the time. For example, try this:
#include <stdlib.h>
int main(void) {
void *p = malloc(10);
free(p); free(p); // <- hey look: bad code
return 0;
}
If you compile and run this, you’ll get this:
bash-3.2$ ./a.out a.out(11352) malloc: *** error for object 0x100120: double free *** set a breakpoint in malloc_error_break to debug
This support pointed out a few goofs for us early in the RTL development stages. I’ll not go into all the details of the OS X memory manager; there is a lot of good documentation on it available, so I’ll just provide some pointers:
The links above include some details about the various switches you can apply to the memory manager, plus some additional custom malloc debugging libraries, plus some cool tools that are included in the shipping OS for analyzing allocations. Worth a read.
I suppose the first question I asked myself when I saw this was "Doesn’t that have an adverse performance effect on applications?". I think there are several answers to this. One is "Apparently not". Another is "To a degree". The first one comes from simple observation that the platform generally rocks. The second is due to reading various arguments and analyses on the web, some of which say the OS X allocator is slower than Linux, but has improved, some of which say these analyses are not very real world, and if you do real world tests, the OS X allocator rocks pretty well. Forgive me the weasel words - I admit to not doing in depth research on this. I haven’t had time to find or come up with a definitive answer for myself here. I may come back to the topic at a later date, though if the allocator doesn’t really show itself to be a problem when we profile things, I’m not going to worry about it. Mostly my point in blogging today was to make newcomers to the Mac platform aware of the fact that there is quite a bit of debugging and profiling support built right into the platform that is worth looking at and learning about.
Share This | Email this page to a friend
Posted by Eli Boling on October 29th, 2009 under C_Builder, Delphi | 7 Comments »OS X Initial Stack Layout
This post falls into the category of ‘I don’t have anything better to write about just now.’ It’s been a little while since I’ve posted, and there was something vaguely interesting here, so why not, I figured.
So, when the MACH kernel loads up a task, it sets up the stack, and transfers control to the entry-point in the image. On any given platform, it’s always good to know just what the OS stuffed onto the initial stack, because there might be something interesting or useful there that you can count on.
This is what your stack looks like on OS X when you hit your program’s entry point:
Bottom of stack (high addresses)
+------------------+
| Env/argv strings |
| |<------
+~~~~~~~~~~~~~~~~~~+ ^
| Path string |<-- |
+------------------+ ^ |
| 0 | | |
+------------------+ | |
| Pointer to Path |--> |
+------------------+ |
| 0 | |
+------------------+ |
| envp[n-1] |------>|
+------------------+ |
... |
+------------------+ |
| envp[0] |------>
+------------------+ |
| 0 | |
+------------------+ |
| argv[argc-1] |------>|
+------------------+ |
... |
+------------------+ |
| argv[0] |------>
+------------------+
| argc |
+------------------+
^
|
ESP ----
So basically, there is the argc, that you C coders would expect, and then the argv and envp arrays are just allocated right on the stack, and all the strings they point to are allocated right at the bottom of the stack. The C RTL startup code has to parse things up a teeny bit to get the traditional argv/envp pointers to pass along. There are two things that are interesting here. First is that the arrays are null terminated, and second is that nifty path entry. The first item is largely academic. The second item is not.
Those of you with *nix background may recall that argv[0] is typically the path of the executable that is running. Typically. There are a number of occasions under Linux, for example, where argv[0] will not have that path. In the Pascal RTL, we jumped through a lot of hoops to try to reduce the odds of that actually causing any of our abstract support routines from failing. If argv[0] didn’t have what we wanted, we would dig around in /proc/self/exe, and find what we wanted. That really worked in pretty much all cases. However, it didn’t work in _all_ cases, because it’s possible to configure Linux to not mount the ‘proc’ file system, or to make it unreadable, in which case you are just plain out of luck. In other *nix platforms (e.g. OS X), the ‘proc’ file system just doesn’t exist at all. Hence this nice little extra parameter on the stack was interesting to me. I found it initially by just rooting around in the stack. After that, I wanted to make sure that it was a reliable thing, so I went rooting around in the sources, and I found this:
http://opensource.apple.com/source/xnu/xnu-1456.1.26/bsd/kern/kern_exec.c
Look for exec_copyout_strings. There are lots of cool things in that file, actually.
The end result is that we have a way on the Mac of always getting the executable path of the image that is currently running. That’s not something that was technically true on Linux. What fun!
Oh, BTW, beyond the string area on the stack is nothing, as in the great void. There is no return address to anything on image startup. Once you are running, you have to leave via a call to exit(), or the equivalent. Or you could just not leave, I suppose.
Share This | Email this page to a friend
Posted by Eli Boling on October 13th, 2009 under C_Builder, Delphi | 3 Comments »Rip and Replace Resources: RIP
So, originally, Windows resources were designed when we didn’t have virtual memory support, and there was a need to have a well specified means for lazy loading of things like string tables and bitmaps. Once virtual memory support became mainstream, and the 16-bit days were over, this need diminished. However, by that time, there was a newer use for separate resource data: localization support.
The thinking for localization was that string tables, and other language sensitive resources could be replaced in a built binary with translated versions. To localize a product, you’d take the built binaries, strip their resources out, and replace them with the translated ones for the country you wanted to package for. Box it up, and out it goes, hooray!
In about this timeframe, the languages team embarked on the Kylix project. We had to support resources on linux, and there was no standard there for this, as there was on Windows. So we specified something, and built tools around it. The design allowed for rip and replace of resources, because that was the model that we felt was predominant at the time.
Fast forward a bit, to right about now. Nowadays, localization is generally not done through rip and replace of resource data. Instead, there are sideband resource DLLs that hold the locale specific data, and the application selects the appropriate resource DLL based on the current or user requested locale. Thus the last reason that I really know of to support rip and replace of resources has vanished, which is good.
So, for linux and Mac OS/X, we have no plans for supporting rip and replace of resources. I’ve reworked the resource tools for POSIX platforms to simplify them a bit, and the data just gets stuffed into a read-only set of pages that are part of the normal image. They may be in a separate section, but that won’t necessarily always be the case. They could get merged in with the read-only data section at some point. The resource compilers will still work for these platforms, and all the RTL APIs for managing resources will be completely compatible. You just won’t ever see tools (e.g. rlink) that allow you to replace the resources that were linked in originally. This makes the design of the tooling simpler on our side, especially for platforms like linux, where the ELF format makes it somewhat tricky to easily manage rip and replace support. MACH files would be easier, but the overall tooling would still be onerous.
Simpler is often better, and this appears to be one of those times.
Share This | Email this page to a friend
Posted by Eli Boling on July 30th, 2009 under C_Builder, Delphi | 9 Comments »dlsym
I used to think that the subtle bits of the POSIX API dlsym were the lookup ordering rules, and the interactions with shared libraries. That’s still true, but I’ve found a new thing to be thumped by.
For a very quick refresher for you readers, dlsym lets you lookup a symbol name in your load image symbol tables. It returns the address of the symbol code/data if it finds the symbol, and NULL otherwise. Here’s a link: http://www.opengroup.org/onlinepubs/009695399/functions/dlsym.html
I thought about trying to describe this chronologically, as I encountered it, but I decided I was just trying to get people to share the pain, so I changed my mind. I think there are basically three relevant chunks or categories of information to convey here. Let’s start with the OS X implementation of dlsym.
In the Mac OS X ABI Dynamic Loader Reference there is a sentence in the dlsym doc: "Unlike in the NS... functions, the symbol parameter doesn’t require a leading underscore to be part of the symbol name." Hunh.
The man page for dlsym is more strongly worded: "Unlike other dyld API’s, the symbol name passed to dlsym() must NOT be prepended with an under-score."
And just so you know how they really feel about it, here’s the dlsym source (dyldAPIs.cpp), severely edited to bring it straight to your door in technicolor:
void* dlsym(void* handle, const char* symbolName) {... // dlsym() assumes symbolName passed in is same as in C source code
// dyld assumes all symbol names have an underscore prefix char underscoredName[strlen(symbolName)+2]; underscoredName[0] = '_'; strcpy(&underscoredName[1], symbolName);
OK, now we should circle back and look at gcc and C code, and a fine point of symbol tables. Consider the following little POSIX ‘compliant’ program:
#include <stdio.h>
#include <dlfcn.h>
void foo() {}
int main(void) {
void *p = dlsym(RTLD_DEFAULT, "foo");
printf("%p\n", p);
return 0;
}
Now if you compile and run this on linux and OS X with gcc, both versions will print out an address for ‘foo’. But take a look at the symbol tables. Do this on OS X:
gcc dlsymtest.c; nm a.out | grep foo
You can do the same thing on linux. On OS X, you’ll get this:
00001faa T _foo
On linux, you’ll get a different address, and the symbol will not be prepended with an underscore. Now, I always interpreted dlsym as being dependent on the literal names in the symbol tables, not the literal names in the source code. This is particularly necessary if you are used to the world of C++ code, where the literal names in the source code have pretty much nothing to do with the literal names in the symbol table because of name mangling. You have to have some situational awareness (non-portable) when you are using dlsym. The OS X implementors, however, have given us something new here. On OS X, dlsym cannot be used to lookup a name unless that name starts with an underscore in the symbol table. I can speculate as to why they chose to do this. I can see a reading of the POSIX spec examples that implies that it should work this way. Personally, however, I think it was misguided. They may have enabled portability for some bits of code (which I don’t think should have been considered entirely portable in the first place), but they added a DWIM (Do What I Mean) layer onto what is, to my mind, a very raw API, adding a pointless requirement to names in symbol tables: they really have to start with an underscore.
Now, I should be clear - there are alternatives to dlsym, and they are documented in the same ABI document that I cited above (the NS* APIs). However, the same document discourages their use, and explicitly directs you to the dl* APIs. The NS* APIs don’t do this dlsym reinterpretation. Something else interesting: the dynamic loader doesn’t use the dlsym logic when binding shared libraries. I know this because when I first wired up the Pascal linker to target OS X, I used existing external function declarations, which didn’t put an underscore on the imported names from libc. Thus I had literal names like ‘malloc’ in my symbol tables. The loader complained. I changed them to ‘_malloc’, and the binding succeeded.
From my point of view, the POSIX dlfcn.h APIs should be interpreted to specify the names of the APIs and types involved, and describe the lookup ordering semantics, but should leave out any interpretation of the literal strings fed into the APIs for names. That’s why I wrapped the word ‘compliant’ in quotes when I described my little C program above. It’s the API type and semantic issues that are the really nasty issue with respect to portability, not the names. Using ifdefs to deal with literal name changes from one compiler to the next is trivial compared to dealing with yet another custom dynamic name lookup API from one platform to the next. Maybe someone can give me a strong rationale for why the OS X implementors did the right thing. I think they goofed.
Share This | Email this page to a friend
Posted by Eli Boling on June 15th, 2009 under C_Builder, Delphi | 4 Comments »Mach-O OBJ Format Trivia
Mac OS X uses the Mach-O format for binaries. The spec for this is well published. There’s a brief summary about it on Wikipedia: http://en.wikipedia.org/wiki/Mach-O. Mostly I have no objections to the file format. You always have to read over the specs carefully to discover all the gotchas. There’s one bit that caused me some headaches: by convention, the header information appears in the first physical segment in the image. This is typically the code segment, and so the header info ends up offsetting the code a bit. The header info side depends on lots of things about your link, so it’s hard to determine it’s size up front. You end up with a bit of a chicken and egg problem if you want to conform. That’s no big deal, really; I just made a point of conforming because there’s no point in looking for trouble with the loader. You always deal with two specs in reality. There’s what’s printed on paper, and there’s what the loader does. Anyway, that’s not really what this post is about. It’s about something that irritated me a little.
The Mach-O format looks like it supports lots and lots of sections. Early adopters (not Mac OS X) used a limited number of sections (e.g. .text, .data and .bss, and that’s all). That was too limiting, so someone made a change to the Mach-O format for Apple. They changed the symbol table entries so that they could include a section index, so you could have the data for a given symbol appear in an arbitrary section. That’s good, because it’s really not enough to just have .text, .data and .bss for your image sections nowadays. Now the bit that bugs me is that they just used an unused field in the symbol structure (struct nlist, in nlist.h), and that field is only a byte. So you can have lots and lots of sections, provided that you don’t have any more than 255. The nlist structure is the only structure that causes this limitation. Now for executable images, it’s really unlikely that in a 32-bit flat image you’ll ever be wanting even close to that number, but the same file format is used for object files, and this raises a little issue for C++. C++ compilers emit tons of duplicate function definitions, and different platforms have different ways of dealing with this. Now, in ELF, the GNU tools emit one section per function. These sections have one symbol pointing at each section, and everything is marked WEAK, so things are only linked in if they are referenced. This is bulky, but really quite nice in other respects, because it’s easy to segregate the data. Because of the choice of a byte field for the section index in the symbol structure, however, this is not viable for C++ object file output on Mac OS X with Mach-O. Indeed, if you look at the output of g++, you will see all the duplicate functions being tossed into one section, leaving the linker to tease the bits apart.
Either way would be fine with me, it’s just that I wish that it were the same paradigm for both binary platforms, so that the various bits of code in the tools could all be very generic. I don’t like the choice being forced because someone was parsimonious with the symbol table layout. Spilt milk.
Share This | Email this page to a friend
Posted by Eli Boling on May 26th, 2009 under C_Builder, Delphi | 3 Comments »Mac OS X Stack Alignment
Preface: I’m back (again) [shades of Men In Black 2, sort of]. If you don’t understand what that means, don’t worry about it.
In the Mac OS X ABI Function Call Guide there is an innocent little sentence: "The stack is 16-byte aligned at the point of function calls." We’ve not been able to find out why this is required for the IA-32 environment, but they really mean it, and there are deep implications.
OS X is, under the covers, your basic *nix system, making heavy use of shared libraries. When your code references a function in a shared library, a little call stub is constructed by the linker, and the loader fixes up this stub to point to a loader helper function which will perform a lazy bind to the function. Take a look at the first instructions of that helper function:
0x8fe18be0 <__dyld_fast_stub_binding_helper_interface+0>: push 0x0 0x8fe18be2 <__dyld_stub_binding_helper_interface+0>: sub esp,0x64 0x8fe18be5 <__dyld_stub_binding_helper_interface+3>: mov DWORD PTR [esp+0x54],eax 0x8fe18be9 <__dyld_stub_binding_helper_interface+7>: mov eax,DWORD PTR [esp+0x68] 0x8fe18bed <__dyld_stub_binding_helper_interface+11>: mov DWORD PTR [esp+0x60],eax 0x8fe18bf1 <__dyld_stub_binding_helper_interface+15>: mov DWORD PTR [esp+0x68],ebp 0x8fe18bf5 <__dyld_stub_binding_helper_interface+19>: mov ebp,esp 0x8fe18bf7 <__dyld_stub_binding_helper_interface+21>: add ebp,0x68 0x8fe18bfa <__dyld_stub_binding_helper_interface+24>: mov DWORD PTR [esp+0x58],ecx 0x8fe18bfe <__dyld_stub_binding_helper_interface+28>: mov DWORD PTR [esp+0x5c],edx 0x8fe18c02 <__dyld_misaligned_stack_error+0>: movdqa XMMWORD PTR [esp+0x10],xmm0 0x8fe18c08 <__dyld_misaligned_stack_error+6>: movdqa XMMWORD PTR [esp+0x20],xmm1 0x8fe18c0e <__dyld_misaligned_stack_error+12>: movdqa XMMWORD PTR [esp+0x30],xmm2 0x8fe18c14 <__dyld_misaligned_stack_error+18>: movdqa XMMWORD PTR [esp+0x40],xmm3
Note the last four instructions. If you backtrack to the first couple of instructions, you’ll see that ESP gets tweaked by a total of 0×68 bytes. Thus, if the stack isn’t aligned to an 8 byte boundary on entry to this helper, the four instructions above will definitely kill you. The symbolic name that GDB reports for these instructions makes me infer that this is a gate-keeper function intended solely to ensure that the ABI stack alignment requirement is met. If you wonder why the alignment on entry has to be 8, recall that we’ve stepped through a linker constructed thunk. So if you back off the return address of that thunk, you have 4 bytes, which is the return address into the user code. Back off that return address, and you are down to the 16-byte alignment that the ABI requires at the point of the function call.
This little gate-keeper is draconian since from a practical standpoint it means we have to maintain the 16-byte stack alignment in pretty much all of our code. The only time you can avoid keeping the stack aligned is if you can guarantee that the call tree you are dealing with will never escape your local unit (in the case of Pascal code). Why? Because at compile time, you cannot guarantee that any particular unit you are making a reference to will not be in a package, and hence reached via the dynamic loader.
For those of you thinking about Mac OS X, this means that you should do some planning. If you have assembly code which is not leaf code, you should inspect it very carefully to see that none of the call trees can escape the unit in which the code lives. Remember that the compiler will use helper functions in native Pascal code for a lot of things, like reference counting interfaces, copying strings, etc. Those helpers live in the System unit, and if you are linking against packages, you’ll go through the dynamic loader to get to them. Even if you don’t link against packages, some helpers might drop to the O/S for memory allocation, which will go through the dynamic loader. So if your assembly code calls what looks like a leaf function in the same unit, where that function is implemented in Pascal, you have to make sure that no helpers are used, if you don’t maintain stack alignment in your assembly code. Your alternative is to go through the assembly code, and manually align the stack prior to each call. I can tell you that I did that for a bunch of code, and it made my teeth ache.
Another option to consider is to just keep Pascal versions of your assembly coded routines, and use those, at least initially, on OS X. There are plenty of good reasons to keep around high level versions of these routines anyway. I’m very fond of this option, personally.
It’s somewhat interesting to look at what gcc does for stack frame construction on OS X. They always build a full EBP frame, and adjust the stack by the largest amount of local storage required up front in the prolog. The stack is aligned once at that point, and from then on, no PUSH instructions are used. I believe this is more efficient in the long haul, but it requires very different management of function return results when building parameter lists where there are nested function calls in the parameter lists.
So to recapitulate, on entry to your functions, you will find the stack aligned to 16-bytes, minus the return address. In other words, ESP will always be 0xnnnnnnnC. If you want to call a function in another unit, you have to ensure the stack is aligned at the point of the call instruction. Here are some examples:
procedure myAsmFunc; asm // ESP will be 0xnnnnnnnC // call procedure A(a: integer, b: integer, c: integer); cdecl; push 0 push 1 push 2 call A // OK, because ESP is now 0xnnnnnnnC - 12 add esp, 12 // call procedure B(a: integer, b: integer); cdecl; push 0 push 1 call B // _NOT_ OK, because ESP is now 0xnnnnnnnC - 8 add esp, 8 // call procedure B(a: integer, b: integer); cdecl; push ecx // add a dword to make the alignment come out push 0 push 1 call B // that's better, because ESP is now 0xnnnnnnnC - 12 add esp, 12 end;
If you write pure Pascal code, then the compiler will take care of all the alignment work for you. Once you’ve done the hand alignment tweaking on existing, beautifully tuned IA-32 assembly code you may find your opinions shifting around on you with respect to how to approach the problem. Best of luck to you.
Oh, one more thing - you have to be careful with your testing of the manual alignment solution. Just because it runs doesn’t mean you got it right. Once you make the first call to the dynamic loader function, the OS will back-patch the thunk with the actual function address that you wanted to bind to. So if some other bit of code calls that thunk before your assembly code calls it, the gate-keeper code may no longer be on duty. It’s best if you single step the code, and verify that ESP & 0xF == 0 at all call instructions.
Share This | Email this page to a friend
Posted by Eli Boling on May 20th, 2009 under C_Builder, Delphi | 16 Comments »

RSS Feed
