Purists and pragmatists will argue over the feature of converting hardware level exceptions into RTL exceptions. This is the feature where you catch, for example, a memory access violation and keep running your application. Purists will say you should never do that, because you could wind up doing serious damage down the line to users’ data, if you don’t know what caused the bad pointer in the first place. Pragmatists will observe that some very large percentage of these come from things like null pointers resulting from some missing fence test, often triggered by some odd combination of user input, and why would you force the user to lose data by letting the app crash? I’m with the pragmatists here, although there are times when I wish I weren’t. Such as when I have to implement the support in the RTL on our various platforms. I’m going to think about a new metric for measuring the effort involved, based on a unit of length of fingernails chewed over the implementation period, which I think better represents the overall effort than a pure time measure.
On Win32, OS/hardware exceptions and language exceptions are all dispatched through the same mechanism: stack based exception registration records. Handling these and even resuming them is pretty straightforward.
On Linux, OS/hardware exceptions are quite a bit trickier to deal with. Exception handling is typically done using a PC mapped scheme that is vendor dependent. There is no standard for language based exceptions, and all the hardware exceptions are dispatched to signal handlers, which you can install with the POSIX signal handling APIs. There are also some strict protocols that you need to follow with respect to handling Ctrl-C and friends if you want to play by the rules for shell scripts. The POSIX signal handling APIs provide portable interfaces for getting the RTL in the game, but the devil is in the details. For example, when your signal handlers are invoked, you get a signal context sent to you that represents the machine state (processor registers, etc). That’s an opaque pointer, because POSIX really can only go so far as to define/restrict what you can and can’t do with the machine state for hardware exceptions. Thus once you land in your handler, everything becomes platform specific with respect to that context, both in its format, and what you can do with it.
In porting to OS X, I’ve tried to use POSIX as the common point between the Linux and OS X support as much as possible. So I did the same thing with exception handling, reviving the Kylix PC mapped exception handling support. The first thing that I did was to try to discover what I could do with the machine context that was delivered to my signal handlers.
Here’s the POSIX spec for your key signal handling setup API: http://www.opengroup.org/onlinepubs/9699919799/functions/sigaction.html. This API has been updated a bit with respect to its formal documentation in the past several years, but the fundamentals of it haven’t changed since I did the exception handling work in Kylix.
I started on OS X writing a simple C program so that I could investigate the behavioral characteristics of the machine context that is delivered to a signal handler. Specifically, I wanted to make sure that I could resume execution on the client thread/stack in the face of memory access violations (e.g. SIGSEGV), and keep the thread sane while delivering the critical data to the RTL for stack unwinding purposes. My simple test case ran fine, so I moved to the next step, which was to debug the signal handler, and confirm the validity and utility of the machine context. There was an immediate hull breach. I don’t remember all the terrible things that happened, but the nut of it is this: your signal handler will never be invoked if you run the app under GDB. To understand the reasons for this, you have to recall the fundamental structure of OS X.
OS X is built on top of the BSD kernel, using the Mach kernel originally developed by CMU. The various POSIX APIs are layered on top of a Mach layer, which is the real process control layer that matters to low level implementors and tools, such as GDB. GDB uses the Mach exception handling APIs for injecting exception handlers into the process to watch for faults in the app, and the support in GDB doesn’t allow for chaining the exceptions on to signal handlers. Here’s a link to one discussion on the matter: http://archives.devshed.com/forums/bsd-93/catching-floating-point-exception-on-os-x-intel-2293939.html. Scroll down to the bottom, and you will see words like ‘forever’, and ‘long standing’ with respect to describing the behavior in GDB.
So that pretty much crosses off using signal handlers for hardware exceptions on OS X - I can’t very well write off GDB. I had looked around a bit more before and after I ran across that post, and found that folks who did VMs (Java, LISP) ran into the same issues. All the posts said pretty much the same thing: you have to use the Mach exception handling APIs if you want reliable support.
The Mach exception APIs work very differently for processing events. The basic model is that you allocate a port over which you will receive messages describing exceptions. The port can be configured to receive various types of exceptions (e.g. memory access, invalid instruction, and math error). Then you spin up a thread on which you wait for messages on that port. If another thread in your application faults, then the OS will stop that thread and dispatch a message to your exception port. The exception handling thread then processes the message. In our case, that means decoding the exception information, pre-chewing it a bit, and then fiddling with the faulting thread’s context to set the thread to continue in an RTL handler with that pre-chewed information sitting on the stack. Then we tell the OS that we’ve processed the exception message successfully, and we go back to looping looking for exception messages. The OS then restarts the faulting thread, where we land in our language specific RTL code to create the RTL exception object and unwind the stack.
Most of this is all under the covers, so to speak, for the application developer, but there are some points that are of interest:
- If you include SysUtils, because of the need for a thread to watch for exceptions, you will always get a second thread allocated for the application.
- We are not going to support recovery from stack faults. I think we had that requirement in Kylix, too, but we’ve made the decision for real here, and I’m going to phase out any Linux support for it as well at this point. That requirement is not really specific to Mach issues, but represents us giving up on the feature because of the various caveats that we’ve had to place on it from one platform to the next, making it too difficult to support in a uniform and useful fashion.
- Ctrl-C events are not dispatched via Mach exception messages; these are still caught as SIGINT, via POSIX signal handlers. This is actually good, because it means we won’t have to go through too much agony ensuring proper shell script handling semantics. So we’ll have both the Mach exception thread, and some signal handlers in play.
If I go into the gory details of the Mach implementation, this post will get too long, so I’m just going to include some links here to some references that I’ve found useful:
I strongly recommend that last article.
There are a couple of other miscellaneous things of interest to note about the Mach exception support. First, the books/articles/examples variously reference a function called exc_server, and a header file called exc_server.h. That header file doesn’t exist on the Mac, but the function is there, and needed. Second, there is an oddity with exc_server: when you call exc_server, it calls back into your application. However, you don’t tell exc_server the callback function. That function has to have a particular name, and is looked up out of your symbol table at runtime. Get the name wrong, and the app goes down in flames on the first exception.Posted by Eli Boling on November 10th, 2009 under C_Builder, Delphi, OS X |