Watch, Follow, &
Connect with Us

The Oracle at Delphi

Older Stuff

Monitoring the Monitor

No, not this Monitor, this Monitor. Ever since it’s introduction, it’s been the subject of both scorn and praise… ok, ok, scorn. Recently, during our current field test for the next version of RAD Studio*, one of our long-time field-testers posted a reference to this excellent blog post by Eric Grange relating to the performance of the TMonitor vs. the Windows Critical Section. In that post, Eric makes a strong case against TMonitor. Since I’m somewhat personally invested in the TMonitor (read: I wrote it), I took that post as a challenge. I would have responded directly were it not for the fact that he’s closed comments for that post. No biggie. For the record, while Eric and I don’t always agree, he always has well reasoned and thorough arguments.

Eric published a simple test case to demonstrate what he was talking about. I took that test case and began to dig into it. First of all I looked into what Windows was doing in the critical section code down in the kernel. A quick spelunk with the CPU view was rather revealing. In this specific case, the critical section wasn’t ever actually blocking on a kernel-level object. In other words it was doing a busy-wait. Under the high-contention test case that Eric presented, the busy-wait was the magic that served to blow TMonitor out of the water! As a baseline here’s my result with Eric’s app unchanged:

Once I determined that a spin-lock was in play, it all became much clearer. TMonitor can (and always has) optionally do a busy-wait before blocking. You must manually set the spin count on a specific TMonitor instance by calling TMonitor.SetSpinCount(Obj, nnnn). So, the first thing I did was to add the following to the FormCreate() event handler:

  System.TMonitor.SetSpinCount(Self, 100000);

With that single change, suddenly things got a whole lot better:

We’re on the right track here… but not perfect. I experimented with various spin counts, but above 1000 there were diminishing returns. So, something else must be in play here. I decided to experiment with a raw spin-lock. Luckily Delphi already has one, System.SyncObjs.TSpinLock, to be exact. Instead of using a TMonitor, I changed the code to use a TSpinLock. Whoa! Look at these results:

Now we’re gettin’ somewhere! Hmmm… so what is the difference here? Setting the spin count on TMonitor should be roughly equivalent to the raw TSpinLock… so looked at the code for each and then it hit me! Exponential backoff! Duh! The TSpinLock implements an exponential backoff that will yield or sleep every X iterations.** TMonitor doesn’t implement an exponential backoff for it’s busy-wait implementation. What is interesting is that the TMonitor does have an internal TSpinWait record that does implement exponential backoff. So I made a very small modification to TMonitor to now use the TSpinWait record within the TMonitor.Enter(Timeout: Cardinal); method within the System unit.

Added the following local var declaration:

  SpinWait: TSpinWait;

Added/Removed the following lines

  if SpinCount > 0 then
    StartCount := GetTickCount;
    SpinWait.Reset; //  Added
    while SpinCount > 0  do
    YieldProcessor; // Removed
    SpinWait.SpinCycle; // Added
    // Keep trying until the spin count expires

Now here’s the final result:

Hmm… I’d say that’s darn near the same. While I cannot say for certain, but I imagine the difference in the graphic patterns could be attributable to a difference in the exponential backoff algorithm and parameters. I also reduced the TMonitor spin count to 500 and got the same results. On my machine anything over 500 didn’t make any difference. Below about 250, it started doing kernel-level blocks and performance dropped.

With just that one simple change, TMonitor is matching and even sometimes now exceeding the performance of the Windows critical section. On the whole, I’d declare them equivalent. This does, however, mean you will need to explicitly set the spin count on the TMonitor to some value. Normally the “right” value is a guess based on empirical testing for your specific use case. I suspect the Windows critical section code selects some value based on the number of cores and CPU speed. At some point, I’ll need to research an algorithm that can automatically select a reasonable spin count.

There is one thing I can say that TMonitor has over the Windows critical section… it works on all platforms (Windows, MacOS, iOS, and soon Android). Feel free to use whichever one you feel more comfortable using for your specific situation. Me? I will fully admit I’m biased…

*Yes, there is a field test going on and if you already have XE4 or purchase XE4, you become eligible to get access. You can go here to request access.

**This is not the same as blocking, it only forces the current thread to relinquish it’s current quanta (time slice) to allow another thread to run. This only adds thread-switching overhead without the added kernel object blocking overhead.

Posted by Allen Bauer on August 23rd, 2013 under Delphi | 25 Comments »

Give in to the ARC side

I thought I’d take a few moments (or more) to answer some questions regarding the just introduced Automatic Reference Counting (ARC) mechanism to Delphi on mobile platforms (namely iOS in XE4, and Android in a future release). Among some sectors of our customer base there seems to be some long-running discussions and kvetching about this change. I will also say that among many other sectors of our customer base, it’s been greeted with marked enthusiasm. Let’s start with some history…

Meet the new boss, same as the old boss.

For our long-time Delphi users, ARC is really nothing new. Beginning back in Delphi 2 (early 1996), the first Delphi release targeting Windows 32bit, the Delphi compiler has done ARC. Long-Strings, which broke free from the long standing Turbo Pascal limit of 255 characters, introduced the Delphi developer to the whole concept of ARC. Until then, strings (using the semi-reserved word “string”) had been declared and allocated “in-place” and were limited to a maximum of 255 characters. You could declare a string type with less than 255 characters by using the “string[<1-255>]” type syntax.

Strings changed from being allocated at compile time to a reference type that pointed to the actual string data in a heap-allocated structure. There are plenty of resources that explain how this works. For the purposes here, the heap-allocated data structure could be shared among many string variables by using a “reference-count” stored within that data structure. This indicated how many variables are pointing to that data. The compiler managed this by calling special RTL helper functions when variables are assigned or leave the current scope. Once the last string reference was removed, the actual heap-allocated structure could be returned to the free memory pool.

Hey this model works pretty well… Oh look! MS’ COM/OLE/ActiveX uses ARC too!

Delphi 3 was a seminal release in the pantheon of Delphi releases. For those who remember history, this is the first Delphi release in which the founder of Turbo Pascal, and the key architect of Delphi, Anders Hejlsberg, was no longer with the company. However, even with the departure of Anders, Delphi 3 saw the release of many new features. Packages and interfaces were major new language enhancements. Specifically interesting were interfaces, which placed support for COM/OLE/ActiveX-Style interfaces directly into the language.

The Delphi compiler already had support and logic for proper handling of an ARC type, strings, so it was an incremental step forward to add such support for interfaces. By adding interface ARC directly into the language, the Delphi developer was freed from the mundane, error-prone task of manually handling the reference counting for COM interfaces. The Delphi developer could focus on the actual business of using and accessing interfaces and COM/ActiveX objects without dealing with all those AddRef/Release calls. Eventually those poor C++ COM developers got a little help through the introduction of “smart pointers”… But for a while, the Delphi developers were able to quietly snicker at those folks working with COM in C++… oh ok, we still do…

Hey, let’s invite dynamic arrays to this party!

Introduced in Delphi 4, dynamic arrays freed developers from having to work with non-range-checked unbounded arrays or deal with the hassles of raw pointers. Sure there are those who get a clear kick out using raw memory locations and mud-wrestling and hog-tying some pointers… Dynamic arrays took a few pages from the string type playbook and applied it an array data type whose elements are now user-defined. Like strings, they too use the ARC model of memory management. This allows the developers to simply set their length and pass them around without worrying about who should manage their memory.

If ARC is good, Garbage Collection is sooo much better…(that’s a joke, son)

For a while, Delphi cruised along pretty well. There was Delphi 5, then a foray into the hot commodity of the time, Linux where the non-Delphi-named Kylix was developed. Delphi 6 unified the whole Windows and Linux experience… however not much really happened on the compiler front other than retargeting the x86 Delphi compiler to generate code fitting for Linux. By the time Delphi 7 was released, work on the Delphi compiler had shifted to targeting .NET (notice I have ignored that whole “oh look! didn’t Borland demonstrate a Delphi compiler targeting the Java VM?” thing… well… um… never-mind…). .NET was the culmination (up to that time) of the efforts of Anders Hejlsberg since he’d left for Microsoft several years prior.

Yes there has been a  lot of discussion over the years about whether or not MS really said that .NET was the “future” of the Windows API… From my recollection, I precisely remember that being explicitly stated from several sectors of the whole MS machine. Given the information at the time, it was clearly prudent for us to look into embracing the “brave new world” of .NET, it’s VM, and whole notion of garbage collection. We’re not mind readers, so regardless of any skepticism about this direction (oh and there was plenty of that), we needed to do something.

Since I have the benefit of 20/20 hindsight and having been a first-hand witness and participant in the whole move to .NET, there are some observations and lessons to be learned from that experience. I remember the whole foray into .NET as generating the most number of new and interesting language innovations and enhancements than I’ve seen since Delphi 2. A new compiler backend generating CIL (aka. MSIL), the introduction of class helpers which allow the injection of methods into the scope of an existing class (this was before C# “extension methods” which are very similar, if not more limited than helpers). I also think there were some missteps, which at the time, I know I vehemently defended. Honestly, I would probably have quite the heated discussion if the “me” of now were to ever meet the “me” of then Winking smile.

Rather than better embrace the .NET platform, we chose to mold that platform into Delphi’s image. This was reinforced by the clearly obvious link between the genesis behind Delphi itself and the genesis of .NET. They both were driven by the same key developer/architect. On several levels this only muddled things up. Rather than embracing GC (I have nothing really against GC, in fact, I find that it enables a lot of really interesting programming models), we chose to, more or less, hide it. This is where the me of now gets to ridicule the me of then.

Let’s look at one specific thing I feel we “got wrong”… Mapping the “Free” method to call IDisposable.Dispose. Yes, at the time this made sense, and I remember saying how great it was because though the magic of class helpers you can call “Free” on any .NET object from Delphi and it would “do the right thing.” Yes, of course it “did the right thing”, at the expense of holding the platform at arm’s length and never really embracing it. Cool your jets… Here, I’m using the term “platform” to refer more to the programming model, and not the built-in frameworks. Not the .NET platform as a whole…People still wrote (including us) their code as if that magic call to Free was still doing what it always has done.

The Free method was introduced onto TObject for the sole purpose of exception safety. It was intended to be used within object destructors. From a destructor, you never directly called Destroy on the instance reference, you would call Free which would check if the instance reference was non-nil and then call Destroy. This was done to simplify the component developer’s (and those that write their own class types) efforts by freeing (no pun intended) them from doing that nil check everywhere. We were very successful in driving home the whole create…try…finally…free coding pattern, which was done because of exceptions. However, that pattern really doesn’t need to use Free, it could have directly called Destroy.

Foo := TMyObj.Create;
  {... work with Foo here}
  Foo.Free; {--- Foo.Destroy is just as valid here}

The reason that Foo.Destroy is valid in this instance is because if an exception were raised during the create of the object, it would never enter the try..finally block. We know that if it enters the try..finally block the assignment to Foo has happened and Foo isn’t nil and is now referencing a valid instance (even if it were garbage or non-nil previously).

Under .NET, Free did no such thing as “free” memory… it may have caused the object to “dispose” of some resources because the object implemented the IDisposable interface pattern. We even went so far as to literally translate the destructor of declared classes in Delphi for .NET into the IDisposable pattern, even if all that destructor did was to call free other object instances, and did nothing with any other non-memory resource. IOW, under a GC environment it did a whole lot of nothing. This may sound like heresy to some, but this was a case where the power of the platform was sacrificed at the altar of full compatibility.

Come to the ARC side

What is different now? With XE4, we’ve introduced a Delphi compiler that directly targets the iOS platform and it’s ARM derivative processor. Along with this, ARC has now come to full fruition and is the default manner in which the lifetime of all object instances are managed. This means that all object references are tracked and accounted for. This also means that, like strings, once the number of references drop to 0, the object is fully cleaned up, destroyed and it’s memory is returned to the heap. You can read about this in detail in this whitepaper here.

If you ask many of my coworkers here, they’ll tell you that I will often say, “Names are important.” Names convey not only what something is, but in many cases what it does. In programming this is of paramount importance. The names you choose need to be concise, but also descriptive. Internal jargon shouldn’t be used, nor should some obscure abbreviation be used.

Since the focus of this piece (aside from the stroll down memory lane) is surrounding “Free”, let’s look at it. In this instance, Free is a verb. Free has become, unfortunately, synonymous with Destroy. However that’s not really it’s intent. It was, as stated above, was about writing exception-safe destructors of classes. Writing exception-safe code is also another topic deserving of its own treatment.

Rather than repeat the mistake in Delphi for .NET in how “Free” was handled, Free was simply removed. Yes, you read that right. Free has been removed. “What?” But my code compiles when I call Free! I see it in the RTL source! We know that there is a lot code out there that uses the proverbial pattern, so what is the deal!? When considering what to do here, we saw a couple of options. One was to break a lot of code out there and force folks to IFDEF their code, the other was to find a way to make that common pattern still do something reasonable. Let’s analyze that commonly implemented pattern using the example I showed above.

Foo is typically a local variable. The intent of the code above is to ensure that the Foo instance is properly cleaned up. Under ARC, we know that the compiler will ensure that will happen regardless. So what does Foo.Free actually do!? Rather than emit the well known “Unknown identifier” error message, the compiler simply generates code similar to what it will be automatically generated to clean up that instance. In simplistic terms, the Foo.Free; statement is translated to Foo := nil; which is then, later in the compile process, translated to an RTL call to drop the reference. This is, effectively, the same code the compiler will generate in the surrounding function’s epilogue. All this code has done is simply do what was going to happen anyway, just a little earlier. As long as no other references to Foo are taken (typically there isn’t even in Non-ARC code), the Foo.Free line will do exactly what the developer expects!

“But, but, but! wait! My code intended to Free, destroy, deallocate, etc… that instance! Now it’s not!” Are you sure? In all the code I’ve analyzed, which includes a lot of internal code, external open-source, even some massive customer projects used for testing and bug tracking, this pattern is not only common, it’s nearly ubiquitous. On that front we’ve succeed in “getting the word out” about the need for exception safety. If the transient local instance reference is the only reference, then ARC dictates that once this one and only one reference is gone, the object will be destroyed and de-allocated as expected. Semantically, your code remains functioning in the same manner as before.

To steal a line from The Matrix, “You have to realize the truth… there is no Free”.

Wait… but my class relies on the destructor running at that point!

Remember the discussion above about our handling of IDisposable under Delphi for .NET? We considered doing something similar… implement some well-known interface, place your disposal code in a Dispose method and then query for the interface and call if present. Yuck! That’s a lot of work to, essentially, duplicate what many folks already have in their destructors. What if you could force the execution of the destructor without actually returning the instance memory to the heap? Any reference to the instance would remain valid, but will be referencing a “disposed” instance (I coined the term a “zombie” instance… it’s essentially dead, but is still shambling around the heap). This is, essentially, the same model as the IDisposable pattern above, but you get it for “free” because you implemented a destructor. For ARC, a new method on TObject was introduced, called DisposeOf.

Why DisposeOf, and not simply Dispose? Well, we wanted to use the term Dispose, however, because Dispose is also a standard function there are some scoping conflicts with existing code. For instance, if you had a destructor that called the Dispose standard function on a typed pointer to release some memory allocated using the New() standard function, it would fail to compile because the “Dispose” method on TObject would be “closer in scope” than the globally scoped Dispose. Bummer… So DisposeOf it is.

We’ve found, in practice, and after looking at a lot of code (we do have many million lines of Delphi code at our disposal, and most of it isn’t ours), it became clear that the more deterministic nature of ARC vs. pure GC, the need to actively dispose of an instance on-demand was a mere fraction of the cases. In the vast majority (likely >90%) can simply let the system work. Especially in legacy code where the above discussed pattern is used. The need for calling DisposeOf explicitly is more the exception than the rule.

So what else does DisoseOf solve? It is very common among various Delphi frameworks (VCL and FireMonkey included), to place active notification or list management code within the constructor and destructor of a class. The Owner/Owned model of TComponent is a key example of such a design. In this case, the existing component framework design relies on many activities other than simple “resource management” to happen in the destructor.

TComponent.Notification() is a key example of such a thing. In this case, the proper way to “dispose” a component, is to use DisposeOf. A TComponent derivative isn’t usually a transient instance, rather it is a longer-lived object which is also surrounded by a whole system of other component instances that make up things such as forms, frames and datamodules. In this instance, use DisposeOf is appropriate.

For class instances that are used transiently, there is no need for any explicit management of the instance. The ARC system will handle it. Even if you have legacy code using the pattern, in the vast majority of cases you can leave that pattern in place and the code will continue to function as expected. If you wanted to write more ARC-aware code, you could remove the try..finally altogether and rely on the function epilogue to manage that instance.

Feel the power of the ARC side.

“Ok, so what is the point? So you removed Free, added DisposeOf… big deal. My code was working just fine so why not just keep the status quo?” Fair question. There is a multi-layered answer to that. One part of the answer involves the continuing evolution of the language. As new and more modern programming styles and techniques are introduced, there should be no reason the Delphi language cannot adopt such things as well. In many cases, these new techniques rely on the existence of some sort of automatic resource/memory management of class instances.

One such feature is operator overloading. By allowing operators on instances, the an expression may end up creating several “temporary” instances that are referenced by “temporary” compiler-created variables inaccessible to the developer. These “temporary” variable must be managed so that instances aren’t improperly “orphaned” causing a memory leak. Relying on the same management mechanism that all other instances use, keeps the compiler, the language, and the runtime clean and consistent.

As the language continues to evolve, we now have a more firm basis on which to build even more interesting functionality. Some things under consideration are enhancements such as fully “rooting” the type system. Under such a type system, all types are, effectively, objects which descend from a single root class. Some of this work is already under way, and some of the syntactical usages are even available today. The addition of “helper types” for non-structured types is intended to give the “feeling” of a rooted type system, where expressions such as “42.ToString();” is valid. When a fully rooted type system is introduced, such an expression will continue to work as expected, however, there will now be, effectively, a real class representing the “42” integer type. Fully rooting the type system will enable many other things that may not be obvious, such as making generics, and type constraints even more powerful.

Other possibilities include adding more “functional programming” or “declarative programming” elements to the language. LINQ is a prime example of functional and declarative programming elements added to an imperative language.

Another very common thing to do with a rooted type system is to actively move simple intrinsic type values to the heap using a concept of “boxing”. This entails actually allocating a “wrapper” instance, which happens to actually be the object that represents the intrinsic type. The “value” is assigned to this wrapper and now you can pass this reference around as any old object (usually as the “root” class, think TObject here). This allows any thing that can reference an object to also reference a simple type. “Unboxing” is the process by which this is reversed and the previously “boxed” value is extracted from the heap-allocated instance.

What ARC would enable here is that you can still interact with the “boxed” value as if it were still a mere value without worrying about who is managing the life-cycle of the instance on the heap. Once the value is “unboxed” and all references to the containing heap-instance are released, the memory is then returned to the heap for reuse.

The key thing to remember here: this is all possible with a natively compiled language without the need for any kind of virtual machine or runtime compilation. It is still strongly and statically typed. In short, the power and advantages gained by “giving in to the ARC side” will become more apparent as we continue to work on the Delphi language… and I’ve not even mentioned some of the things we’re working on for C++…

Posted by Allen Bauer on June 14th, 2013 under ARC, ARM, Android, Delphi, Mobile, iOS | 60 Comments »

Delphi-Treff interview–In English

I recently did an email interview with Martin Strohal of the Delphi-Treff Team. I got permission to publish the original English version (Since my German is a little rusty…)

Delphi XE2 will be published this year. What are the key features of this new release? (Is this the release named "Pulsar"?)

Customers will now be able to target Windows 32bit, Windows 64bit, and Mac OSX 32bit. XE2 introduces a new cross-platform GUI-centric, GPU accelerated component framework called, FireMonkey. VCL also received an extensive upgrade with the introduction of Styles. New in XE2 is LiveBindings. This provides a powerful and flexible system that allows binding any kind of data source to any property or properties. The data source can be nearly anything, including other properties.

There will be a new framework called FireMonkey. Can you tell us, how FireMonkey works and what’s is job?

FireMonkey is designed from the ground up to be cross-platform. It, by design, isolates all platform specifics into an independent platform layer. While FireMonkey extensively uses components, how it actually renders to the GUI is significantly different from VCL. While VCL uses independent, self-contained components that all render using their own techniques or even wrap existing Windows controls, FireMonkey manages the display of content using compositing. This allows for significantly more flexibility in GUI-design. Animation is built into the framework in order to allow very interactive and advanced user interactions. Like animation, filters and transforms are also built in which allow the who UI of portions thereof to be manipulated. For instance, a small modal popup could be displayed and rather than merely disabling the main UI, you could apply a blurring effect to the UI behind the modal popup giving it more depth of field. This blurring effect is applied while compositing the UI and is independent of any rendering of the components/controls.

Is FireMonkey a replacement for the VCL or an addition?

VCL was first and foremost designed to be a relatively thin wrapper to make Windows programming simpler and more accessible. VCL effectively embraced many Windows programming concepts and made them intrinsic to the framework. This certainly made Windows programming a far more productive and pleasant experience. It also inextricably tied VCL to the Windows platform and all its unique characteristics. We had several goals with FireMonkey. First of all we wanted a framework that allowed for the creation of very rich, interactive, modern UIs. We also wanted a framework that wasn’t hog-tied to a given platform. FireMonkey is not intended as a replacement for VCL; rather it is intended as a whole new way for customers to embrace the emerging market for richer, more interactive desktop applications along with the burgeoning mobile space.

If I want to run an existing Delphi application under Mac OS X. Do I have to convert it to FireMonkey first? Will there be a converter?

VCL and FireMonkey share common RTL and database components such as dbXpress and DataSnap. While you will not be able to simply recompile your VCL based application for Mac OSX, you will be able to take all your code which exclusively uses the RTL and DB components. As for converters, I know that at the time of this writing there are several third-parties offering VCL->FireMonkey converter products.

What are your future plans for FireMonkey?

More platforms and mobile. FireMonkey is how we’re keeping relevant in the emerging heterogeneous mobile and desktop platform world currently emerging. Throughout most of the ’90s and early ’00s, the mobile computing space was non-existent or very niche. Apple and the Mac OS were actually in the decline and many weren’t sure they’d be around to see 2000. What a different world we’re in now. The desktop Mac OSX is making significant inroads into the enterprise, and the mobile space is anything by niche. Tying Delphi strictly to the Windows platforms ignores huge opportunities for both Embarcadero and all our Delphi customers, new and old. With FireMonkey, XE2 is positioned to be the only /native/ cross-platform framework that targets both major desktop operating systems and one of the dominant mobile operating systems, iOS. Expect to see FireMonkey become more powerful and even easier to use and target even more mobile platforms in future releases.

The applications cross compiled for OS X are native. Is there the new Delphi compiler on duty? And will it be used for "normal" Win32 applications in future?

There are three new compilers introduced with XE2. Delphi Windows 64bit, Delphi Mac OSX 32bit, and C++ Mac OSX 32bit. All of these compilers are derived from the existing codebase. They all essentially share the same respective "front-ends", the part of the compiler that translates the source-code into an intermediate form in preparation for generating machine code. The existing 32bit Delphi and 32bit C++ compilers are still very much in business. We have some research projects in progress for targeting even more platforms and CPU architectures.

If new compiler: Is the new compiler fully downwards compatible? Or are there some functions abandoned?

For XE2, the current compilers were employed in order to ensure maximum backward compatibility. Looking to the future, we’re currently researching new directions for both a compiler architecture which allows for quicker targeting of new architectures and looking at adding more advanced, and even more modern language features. This may mean eschewing some older features of the language.

Are there some new Features in Delphi XE2 for people who will only develop VCL-Win32-Applications?

As evidenced by XE2, VCL is still very much a key part of the product. With the addition of Styles, the programmer can take their existing VCL based applications and update and modernize the look and feel by using the new Style engine. The third-party component support remains one of, if not the best for all independent development tools on the market. VCL is still the fastest and easiest way to develop*Windows* applications. Also, with XE2 and now being able to target 64bit Windows, most VCL applications can now be merely recompiled for 64bit, subject to the normal 32bit->64bit caveats.

Will there be a new Starter edition again? And do you have any plans for a free Delphi (for getting more new blood in the Delphi community)?

Starter edition is very much a key part of our product line. When you compare the price point of the Starter edition taking account of inflation with the price of the original Turbo Pascal coupled with the vastly superior capabilities of Starter compare to Turbo Pascal, I think you get far more value than the price. We also have very competitive offerings for the educational markets, where one can get nearly 80-90% off of all the products. As for a free edition, we’re always looking at ways to grow the community base without the potential for harming our existing, very strong and growing market. At this point we feel that the Starter edition provides a good balance of price, capabilities and value. Starter is positioned directly at the new customer by including features that most new customers would need right away to in order to both learn the environment and begin to develop commercial applications.

Posted by Allen Bauer on October 14th, 2011 under CodeGear | 3 Comments »

More x64 assembler fun-facts–new assembler directives

The Windows x64 ABI (Application Binary Interface) presents some new challenges for assembly programming that don’t exist for x86. A couple of the changes that must be taken into account can can be seen as very positive. First of all, there is now one and only one OS specified calling convention. We certainly could have devised our own calling convention like in x86 where it is a register-based convention, however since the system calling convention was already register based, that would have been an unnecessary complication. The other significant change is that the stack must always remain aligned on 16 byte boundaries. This seems a little onerous at first, but I’ll explain how and why it’s necessary along how it can actually make calling other functions from assembly code more efficient and sometimes even faster than x86. For a detailed description of the calling convention, register usage and reservations, etc… please see this. Another thing that I’ll discuss is exceptions and why all of this is necessary.

For an given function there are three parts we’re going to talk about, the prolog, body, and epilog. The prologue and epilogue contain all the setup and tear-down of the function’s “frame”. The prolog is where all the space on the stack is reserved for local variables and, different from how the x86 compiler works, the space for the maximum number of parameter space needed for all the function calls within the body. The epilog does the reverse and releases the reserved stack space just prior to returning to the caller. The body of a function is where the user’s code is placed, either in Pascal, or as we’ll see this is where your assembler code you write will go.

You may be wondering why the prolog is reserving parameter space in addition to the space needed for local variables. Why not just push the parameters on the stack right before calling a function? While there is technically nothing keeping the compiler from placing parameters for a function call on the stack immediately before a call, this will have the effect of making the exception tables larger. As I mentioned above, exceptions in x64 are not implemented the same as in x86, which was a stack-based linked list of records. In x64, exceptions are done using extra data generated by the compiler that describes the stack changes for a given function and where the handlers/finally blocks are located. By only modifying the stack within the prolog and epilog, “unwinding” the stack is easier and more accurate. Another side benefit is that when passing stack parameters to functions, the space is already available so the data merely needs to be “MOV”ed onto the stack without the need for a PUSH. The stack also remains properly aligned, so no extra finagling of the RSP register is necessary.


Delphi for Windows 64bit introduced several new assembler directives or “pseudo-instructions”, .NOFRAME, .PARAMS, .PUSHNV, and .SAVENV. These directives allow you to control how the compiler sets up the context frame and ensures that the proper exception table information is generated.


Some functions never make calls to other functions. These are called “leaf” functions because the don’t do any further “branching” out to other functions, so like a tree, they represent the “leaf” For functions such as this, having a full stack frame may be extra overhead you want eliminate. While the compiler does try and eliminate the stack frame if it can, there are times that it simply cannot automatically figure this out. If you are certain a frame is unnecessary, you can use this directive as a hint to the compiler.

.PARAMS <max params>

This one may be a little confusing because it does not refer to the parameters passed into the current function, rather this directive should be placed near the top of the function (preferably before any actual CPU instructions) with a single ordinal parameter to tell the compiler what the maximum number of parameters will be needed for all the function calls within the body. This will allow the compiler to properly reserve extra, properly aligned, stack space for passing parameters to other functions. This number should reflect the maximum number of parameters for all functions and should include even those parameters that are passed in registers. If you’re going to call a function that takes 6 parameters, then you should use “.PARAMS 6”.

When you use the .PARAMS directive, a pseudo-variable @Params becomes available to simplify passing parameters to other functions. It’s fairly easy to load up a few registers and make a call, but the x64 calling convention also requires that callers reserve space on the stack even for register parameters. The .PARAMS directive ensures this is the case, so you should still use the .PARAMS directive even if you’re going to call a function in which all parameters are passed in registers. You use the @Params pseudo-variable as an array, where the first parameter is at index 0. You generally don’t actually use the first 4 array elements since those must be passed in registers, so you’ll start at parameter index 4. The default element size is the register size of 64bits, so if you want to pass a smaller value, you’ll need a cast or size override such as “DWORD PTR @Params[4]”, or “ @Params[4].Byte”. Using the @Params pseudo-variable will save the programmer from having to manually calculate the offsets based on alignments and local variables. UPDATE: I foobar’ed that one… The @Params[] array is an array of bytes, which allows you to address every byte of the parameters. Each parameter takes up 8 bytes (64bits), so you’ll need to scale accordingly to access each parameter. Casting or size overrides are still necessary. The above bad example should have been: “DWORD PTR @Params[4*8]” or “ @Params[4*8].Byte”. Sorry about that.


According to the x64 calling convention and register usage spec, there are some registers which are considered non-volatile. This means that certain registers are guaranteed to have the same value after a function call as it had before the function call. This doesn’t mean this register is not available for usage,  it just means the called function must ensure it is properly preserved and restored. The best place to preserve the value is on the stack, but that means space should be reserved for it. These directives provide both the function of ensuring the compiler includes space for the register in the generated prolog code and actually places the register’s value in that reserved location. It also ensures that the function epilog properly restores the register before cleaning up the local frame. .PUSHNV works with the 64bit general purpose registers RAX…R15 and .SAVENV works with the 128bit XMM0..XMM15 SSE2 registers. See the above link for a description of which registers are considered non-volatile. Even though you can specify any register, volatile or non-volatile as a parameter to these directives, only those registers which are actually non-volatile will be preserved. For instance, .PUSHNV R11 will assemble just fine, but no changes to the frame will be made. Whereas, .PUSHNV R12 will place a PUSH R12 instruction right after the PUSH RBP instruction in the prolog. The compiler will also continue to ensure that the stack remains aligned. Remember when I talked about why the stack must remain 16byte aligned? One key reason is that many SSE2 instructions which operate on 128bit memory entities require that the memory access be aligned on a 16byte boundary. Because the compiler ensures this is the case, the space reserved by the .SAVENV directive is guaranteed to be 16byte aligned.

Writing assembler code in the new x64 world can be daunting and frustrating due to the very strict requirements on stack alignment and exception meta-data. By using the above directives, you are signaling your intentions to the one thing that is pretty darn good at ensuring all those requirements are met; the compiler. You should always ensure the directives are placed at the top of the assembler function body before any actual CPU instructions. This makes sure the compiler has all the information and everything is already calculated for when it begins to see the actual CPU instructions and needs to know what the offset from RBP where that local variable is located. Also, by ensuring that all stack manipulations happen within the prolog and epilog, the system will be able to properly “unwind” the stack past a properly written assembler function. Without this data, the OS unwind process could become lost and at worst, skip exception handlers, or at worst call the wrong one and lead to further corruption. If the unwind process gets lost enough, the OS may simply kill the process without any warning, similar to what stack overflows do in 32bit (and 64bit).

Posted by Allen Bauer on October 10th, 2011 under 64bit, CodeGear, Delphi, General, Work | Comment now »

x64 assembler fun-facts

While implementing the x64 built-in assembler for Delphi 64bit, I got to “know” the AMD64/EM64T architecture a lot more. The good thing about the x64 architecture is that it really builds on the existing instruction format and design. However, unlike the move from 16bit to 32bit where most existing instruction encodings were automatically promoted to using 32bit arguments, the x64 design takes a different approach.

One myth about the x64 instructions is that “everything’s wider.” That’s not the case. In fact many addressing modes which were taken as absolute addresses (actually offsets within a segment, but the segments are 4G in 32bit), are actually now 32bit relative offsets now. There are very few addressing modes which use a full 64bit absolute address. Most addressing modes are 32bit offsets relative to one of the 64bit registers. One interesting addressing mode that is “implied” in many instruction encodings is the notion of RIP-relative addressing. RIP, is the 64bit equivalent of the 32bit EIP, or 16bit IP, or Instruction Pointer. This represents from which address the CPU will fetch the next instruction for execution. Most hard-coded addresses within many instructions are now relative offsets from the current RIP register. This is probably the biggest thing you have to wrap your head around when moving from 32bit assembler.

Even though many instructions will implicitly use the RIP-relative addressing mode, there are some instruction addressing modes that continue to use a 32bit offset, and are not RIP-relative. This can really bite you when doing simple mechanical translations from 32bit to 64bit. These are the SIB form with a 32bit (or even 8bit) offset. What can happen is that you end up forming an address that can only address 32bits, and is thus limited to addressing items below the 4G boundary! And this is a perfectly legal instruction! To demonstration this, consider the following 32bit assembler that we’ll translate to 64bits.

    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
    MOV AX,[EAX * 2 + TestArray]

Let’s now translate this for use in 64bit using a simple mechanical translation.

    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
    MOV AX,[RAX * 2 + TestArray]

Pretty straight forward, right? Not so fast there partner. Let’s see; I know that I need to use a full 64bit register for the offset but since Integer is still 32bits, I need to “sign-extend” it to 64bits. The venerable MOVSX (Move with sign extension) instruction “promotes” the signed 32bit offset to 64bits while preserving the sign. Nope, that’s not a problem. The only thing I changed in the next instruction was EAX to RAX, so how could that be a problem? Well, when you compile this code you’ll get a rather strange error message:

[DCC Error] Project7.dpr(18): E2577 Assembler instruction requires a 32bit absolute address fixup which is invalid for 64bit

Huh? Remember the little note above about the SIB instruction form? Because the RAX (or EAX in 32bit) register is being scaled (the * 2), this instruction must use the SIB (Scale-Index-Base) instruction form. When using the SIB form RIP isn’t considered when calculating the actual address. Additionally, the offset encoded in the instruction can still only be 8 or 32bits. No 64bit offsets.

In 32bit, the compiler would generate a “fixup” to ensure that the encoding of the instruction offset field to the global “TestArray” variable was properly “fixed up” at runtime should the image happened to be relocated to another address. This is a 32bit absolute address. The 64bit version of this instruction, while actually a truly valid instruction, would only have 32bits in which to place the address of “TestArray.” The “fixup” generated would have to remain 32bit. This could lead to creating an image that were it ever relocated above the 4G boundary, would likely crash at best or read the wrong memory address at worst!

Ok, so now what? There is a SIB form that we can use to work around this problem, but it requires burning another register. The good news is that we now have another 8 registers with which to work. So if you have a rather complicated chunk of 32bit assembler code that burns up all the existing usable 32bit registers, you now have another group of registers that can help solve this problem without having to rework the code even more. So here’s how to fix this for 64 bit:

    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
    LEA R10,[TestArray]
    MOV AX,[RAX * 2 + R10]

Here, I used the volatile R10 register (R8 an R9 are used for parameter passing) to get the absolute address of TestArray using the LEA instruction. While the “address” portion of this instruction is still 32bits, it is taken as RIP-relative. In other words, this value is the “distance” from the next instruction to the variable TestArray in memory. After this instruction, R10 now contains a true 64bit address of the TestArray variable. I must still use the SIB form in the next instruction, but instead of a hard-coded “offset” I use the value in R10. Yes, there is still an implicit offset of 0, which uses the 8bit offset form.

You can see that mindless, mechanical translations of assembler code is likely to cause you some grief due to some of the subtle changes in instruction behaviors. For this very reason, we strongly recommend you use all Object Pascal code instead of resorting to assembler when possible. This will not only better ensure that your code will more likely move unchanged to other processor architectures (think ARM here folks), but you’ll not have to worry about such assembler gotchas in the future. If you’re using assembler code because “it’s faster,” I would encourage you to look closely at the algorithm used. There are many cases where the proper algorithm written in Object Pascal will yield greater gains than a simple translation to assembler using the same algorithm. Yes there are some things which you simply must do in assembler (strange, off-beat calling conventions, “LOCK” instructions for concurrency, etc…), but I would contend that many assembler functions can be moved back to Object Pascal with little impact on performance.

Posted by Allen Bauer on October 5th, 2011 under 64bit, Delphi, General | 6 Comments »

“Talk Amongst Yourselves” #3

So far we’ve had “Testing synchronization primitives” and “Writing a ‘self-monitoring’ thread-pool.” Let’s build on those topics, and discuss what to do with exceptions that occur within a scheduled work item within a thread pool.

My view is that exceptions should be caught and held for later inspection, or re-raised at some synchronization point. What do you think should happen to the exceptions? Should they silently disappear, tear-down the entire application, or should some mechanism be in place to allow the programmer to decide what to do with them?

Posted by Allen Bauer on April 7th, 2010 under Delphi, General, Parallel Programming, Work | 5 Comments »

Another installment of “Talk Amongst Yourselves”

Let’s start thinking about thread pools. How do you manage a general purpose thread pool in the face of no-so-well-written-code? For instance, a task dispatched into the thread pool never returns, effectively locking that thread from ever being recycled. How do you monitor this? How long do you wait before spooling out a new thread? Do you keep a “monitor thread” that periodically checks if a thread has been running longer than some (tunable) value? What are the various techniques for addressing this problem?

So, there you go… Talk amongst yourselves.

Posted by Allen Bauer on March 26th, 2010 under CodeGear, Delphi, General, Parallel Programming, Work | 7 Comments »

This is the last day…

In this office. I’ve been in the same physical office for nearly 15 years. After years of accumulation, it now looks positively barren. Beginning next Monday, March 29th, 2010, I’ll be in a new building, new location, and new office. The good thing is that the new place is a mere stone’s throw from the current one. It will be great to leave all the Borland ghosts behind.

Posted by Allen Bauer on March 26th, 2010 under CodeGear, Delphi, Generics, Personal, Work | 3 Comments »

Simple question… very hard answer… Talk amongst yourselves…

I’m going to try a completely different approach to this post. I’ll post a question and simply let the discussion ensue. I would even encourage the discussion to spill over to the public newsgroups/forums. Question for today is:

How can you effectively unit-test synchronization primitives for correctness or more generally, how would you test a concurrency library?

Let’s see how far we can get down this rabbit hole ;-).

Posted by Allen Bauer on March 22nd, 2010 under CodeGear, Delphi, General, Parallel Programming | 26 Comments »

A Happy Accident and a Silly Accident

By now you’re all aware that we’re getting ready to move to a new building here in Scotts Valley. This process is giving us a chance to clean out our offices and during all these archeological expeditions, some lost artifacts are being (re)discovered. Note the following:

SNC00041 SNC00042

These are some bookends that my father made for me within the first year after moving my family to California to work on the Turbo Pascal team. He made these at least two years before Delphi was released, and at a few 6 months before we even began work on it in earnest. Certainly before the codename “Delphi” was ever thought of. I suppose they are my “happy” accident.

This next one is just sad. I received this award at the 2004 Borcon in San Jose from, then Borland President/CEO, Dale Fuller. My title at that time was “Principal Architect”… Of course I like to think that I have strong principles, and maybe that was what they were trying to say… Within a week or so after I got this plaque, another one arrived with the correct spelling of my title. I keep this one just for the sheer hilarity of it. Also, it is a big chunk of heavy marble, so maybe one day I can use to to create a small marble topped table…


Posted by Allen Bauer on February 23rd, 2010 under CodeGear, General, Random, Work | 3 Comments »

Server Response from: BLOGS2