Watch, Follow, &
Connect with Us

The Oracle at Delphi













Older Stuff



x64 assembler fun-facts

While implementing the x64 built-in assembler for Delphi 64bit, I got to “know” the AMD64/EM64T architecture a lot more. The good thing about the x64 architecture is that it really builds on the existing instruction format and design. However, unlike the move from 16bit to 32bit where most existing instruction encodings were automatically promoted to using 32bit arguments, the x64 design takes a different approach.

One myth about the x64 instructions is that “everything’s wider.” That’s not the case. In fact many addressing modes which were taken as absolute addresses (actually offsets within a segment, but the segments are 4G in 32bit), are actually now 32bit relative offsets now. There are very few addressing modes which use a full 64bit absolute address. Most addressing modes are 32bit offsets relative to one of the 64bit registers. One interesting addressing mode that is “implied” in many instruction encodings is the notion of RIP-relative addressing. RIP, is the 64bit equivalent of the 32bit EIP, or 16bit IP, or Instruction Pointer. This represents from which address the CPU will fetch the next instruction for execution. Most hard-coded addresses within many instructions are now relative offsets from the current RIP register. This is probably the biggest thing you have to wrap your head around when moving from 32bit assembler.

Even though many instructions will implicitly use the RIP-relative addressing mode, there are some instruction addressing modes that continue to use a 32bit offset, and are not RIP-relative. This can really bite you when doing simple mechanical translations from 32bit to 64bit. These are the SIB form with a 32bit (or even 8bit) offset. What can happen is that you end up forming an address that can only address 32bits, and is thus limited to addressing items below the 4G boundary! And this is a perfectly legal instruction! To demonstration this, consider the following 32bit assembler that we’ll translate to 64bits.

  var
    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
  asm
    MOV AX,[EAX * 2 + TestArray]
  end;

Let’s now translate this for use in 64bit using a simple mechanical translation.

  var
    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
  asm
    MOVSX RAX,ECX
    MOV AX,[RAX * 2 + TestArray]
  end;

Pretty straight forward, right? Not so fast there partner. Let’s see; I know that I need to use a full 64bit register for the offset but since Integer is still 32bits, I need to “sign-extend” it to 64bits. The venerable MOVSX (Move with sign extension) instruction “promotes” the signed 32bit offset to 64bits while preserving the sign. Nope, that’s not a problem. The only thing I changed in the next instruction was EAX to RAX, so how could that be a problem? Well, when you compile this code you’ll get a rather strange error message:

[DCC Error] Project7.dpr(18): E2577 Assembler instruction requires a 32bit absolute address fixup which is invalid for 64bit

Huh? Remember the little note above about the SIB instruction form? Because the RAX (or EAX in 32bit) register is being scaled (the * 2), this instruction must use the SIB (Scale-Index-Base) instruction form. When using the SIB form RIP isn’t considered when calculating the actual address. Additionally, the offset encoded in the instruction can still only be 8 or 32bits. No 64bit offsets.

In 32bit, the compiler would generate a “fixup” to ensure that the encoding of the instruction offset field to the global “TestArray” variable was properly “fixed up” at runtime should the image happened to be relocated to another address. This is a 32bit absolute address. The 64bit version of this instruction, while actually a truly valid instruction, would only have 32bits in which to place the address of “TestArray.” The “fixup” generated would have to remain 32bit. This could lead to creating an image that were it ever relocated above the 4G boundary, would likely crash at best or read the wrong memory address at worst!

Ok, so now what? There is a SIB form that we can use to work around this problem, but it requires burning another register. The good news is that we now have another 8 registers with which to work. So if you have a rather complicated chunk of 32bit assembler code that burns up all the existing usable 32bit registers, you now have another group of registers that can help solve this problem without having to rework the code even more. So here’s how to fix this for 64 bit:

  var
    TestArray: array[0..255] of Word;

  function GetValue(Index: Integer): Word;
  asm
    MOVSX RAX,ECX
    LEA R10,[TestArray]
    MOV AX,[RAX * 2 + R10]
  end;

Here, I used the volatile R10 register (R8 an R9 are used for parameter passing) to get the absolute address of TestArray using the LEA instruction. While the “address” portion of this instruction is still 32bits, it is taken as RIP-relative. In other words, this value is the “distance” from the next instruction to the variable TestArray in memory. After this instruction, R10 now contains a true 64bit address of the TestArray variable. I must still use the SIB form in the next instruction, but instead of a hard-coded “offset” I use the value in R10. Yes, there is still an implicit offset of 0, which uses the 8bit offset form.

You can see that mindless, mechanical translations of assembler code is likely to cause you some grief due to some of the subtle changes in instruction behaviors. For this very reason, we strongly recommend you use all Object Pascal code instead of resorting to assembler when possible. This will not only better ensure that your code will more likely move unchanged to other processor architectures (think ARM here folks), but you’ll not have to worry about such assembler gotchas in the future. If you’re using assembler code because “it’s faster,” I would encourage you to look closely at the algorithm used. There are many cases where the proper algorithm written in Object Pascal will yield greater gains than a simple translation to assembler using the same algorithm. Yes there are some things which you simply must do in assembler (strange, off-beat calling conventions, “LOCK” instructions for concurrency, etc…), but I would contend that many assembler functions can be moved back to Object Pascal with little impact on performance.

Posted by Allen Bauer on October 5th, 2011 under 64bit, Delphi, General |



6 Responses to “x64 assembler fun-facts”

  1. Michael Thuma Says:

    Interesting!

  2. Oma Says:

    think ARM??
    Arm programmers have been trained since 1988 to think as addresses relative to the PC (R15), or any other register. For x86 people this is a gotcha, for ARM people the AMD64 resembles much more to their paradigm

  3. Allen Bauer Says:

    Oma,

    I’m taking it you didn’t actually read the paragraph in which that comment was made. I wasn’t referring to assembly code, but rather to the fact that we recommend you *not* use assembly code and use pure Pascal code in order to better be prepared for future platforms and CPU architectures.

  4. David Schwartz Says:

    Anybody who’s ever written any ROM-based code understands the value of IP-relative addressing (a.k.a. PIC, or Position Independent Coding). Most ARM CPUs are targeted at embedded applications, meaning the code resides in ROM.

    But there are plenty of x86 applications that are ROM-based as well. There’s nothing about the x86 architecture that prevents you from employing PIC solutions.

  5. Allen Bauer Says:

    @David,

    Of course. The Mac OSX targeting x86 compilers generates PIC. However, since the x86 architecture doesn’t allow for IP-relative addressing, another register, EBX, is burned for this purpose. x64 with IP-relative addressing, certainly makes this far more easier for the code generator.

  6. Sean O'Connor Says:

    I am not sure if I will do any x86 assembly code in future. One big problem is the amount of noise generated by such power hungry systems. Also there is the cost factor. Admittedly something like Xeon Phi is very impressive. But I think you can do a lot with a sub 5 watt ARM processor operating at 1.6GHz and with the Neon SIMD instructions.
    My 7 year old Intel laptop has nearly had its day. I have a $50 Android PC stick connected to the HDMI port of my TV and it is much better for watching videos and listening to music. Using the AIDE app you can do Java, (NDK) C and inline ARM assembly language programming. 100 Android stick PC’s would yield 200 ARM cores @ 1.6GHz, 100 GB RAM and 400 GB flash memory for 500 watts power consumption. At a cost of $5000, about the same as 1 Xeon Phi.

Leave a Comment

Server Response from: BLOGS1