Last week a (large) customer sent me an email indicating he was experiencing issues when compiling the same project on different machines. Turned out the difference was in the source code files format and the root cause was a unit saved as UTF-8 but without a BOM. The reason? One of the developers is using Visual Studio Code… and the solution is a chancing that or using compiler flag. But before I get to the solution, let me show you the problem with a very simple test case.
Table of Contents
Delphi and Source Files Encoding
To test the scenario, we (it was one of the architects who came up with the simple scenario) created a simple VCL application with code like the following:
1 |
<strong>var</strong> <br> strEuro : String = 'Euro=€'; <br><br><strong>procedure</strong> TForm16.Button1Click(Sender: TObject); <br><strong>begin</strong> <br> Button1.Caption := 'Hello ' + strEuro;<br><strong>end</strong>; |
By default, the editor in the Delphi IDE uses ANSI encoding, and everything works fine. You can use the editor context menu, pick the File Format submenu, select UTF8 (I know, missing -), and everything keeps working as expected. Notice, though the length in bytes of the string changes, as you need multiple bytes to represent the Euro symbol in UTF-8.
Enter Visual Studio Code
While many modern editors use UTF-8 as their standard file format, a nice option we are considering to default to also in RAD Studio, Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM. The BOM (Byte Order Mark) is a sequence of bytes (3 bytes for for UTF-8) that marks the file to make it simple for an editor or a tool processing the file (like a compiler) to figure out the internal format. Given the similarities in many cases, you might have to read an entire file to see if it is UTF-8 or ANSI encoded.
When you open a Delphi unit in VSC, it keeps the formatting. If it is UTF-8 with BOM, it remains as such. But for a new file the default is UTF-8 with no BOM. And any file can be saved with that format, as you can see in the status bar options:
Now if you do this, compile the application again, you’ll get a nice caption for your button, rather than the Euro symbol:
Use a BOM or a Compiler Flag
Once you realize the issue, the solution is not that complex to achieve.
On one side, you can make sure your editors save the UTF-8 files with a BOM. The compiler sees it, and all works fine.
On the other side, you can tell the compiler to consider the source code file as UTF-8, regardless of the BOM. You can do this using the compiler flag:
1 |
--codepage:65001 |
or setting the codepage in the compiler options to that value:
Good Work Around, Future Solution
I hope this will help you avoid a similar issue. We are researching making the codepage detection automatic, so that the compiler could automatically pick up the right option. But not something we plan to addressing shortly… as the R&D team is busy with the coming release.
Design. Code. Compile. Deploy.
Start Free Trial Upgrade Today
Free Delphi Community Edition Free C++Builder Community Edition
Dear Marco,
thank you so much for mentioning this issue and documenting workarounds! We actually ran into exactly the same problem quickly 🙂
The only thing I’d like to add is that “Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM” might be put a little harshly. There are arguments to be made for omitting the BOM to increase compatibility with unicode-unaware software and it seems like UTF-8 without BOM has evolved to be the de-facto standard encoding for basically everything.
It’s completely understandable that the Delphi compiler acts the way it acts for backward-compatibility reasons, but I think it might be more appropriate to call Delphi’s behavior a bit unexpected or quirky in a now UTF-8-dominated world.
The [Wikipedia article](https://www.wikiwand.com/en/Byte_order_mark#/UTF-8) on byte order mark sums it up quite nicely:
>>The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”
Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.<<
Best regards,
Marcus