The Delphi Compiler and UTF-8 Encoded Source Code Files With no BOM

the future of starts demands massive productivity

Last week a (large) customer sent me an email indicating he was experiencing issues when compiling the same project on different machines. Turned out the difference was in the source code files format and the root cause was a unit saved as UTF-8 but without a BOM. The reason? One of the developers is using Visual Studio Code… and the solution is a chancing that or using compiler flag. But before I get to the solution, let me show you the problem with a very simple test case.

Table of Contents

Delphi and Source Files Encoding

To test the scenario, we (it was one of the architects who came up with the simple scenario) created a simple VCL application with code like the following:

<strong>var</strong> <br>  strEuro : String = 'Euro=€'; <br><br><strong>procedure</strong> TForm16.Button1Click(Sender: TObject); <br><strong>begin</strong> <br>  Button1.Caption := 'Hello ' + strEuro;<br><strong>end</strong>;

1	var</strong> strEuro : String = 'Euro=€'; procedure</strong> TForm16.Button1Click(Sender: TObject); begin</strong> Button1.Caption := 'Hello ' + strEuro; end</strong>;

By default, the editor in the Delphi IDE uses ANSI encoding, and everything works fine. You can use the editor context menu, pick the File Format submenu, select UTF8 (I know, missing -), and everything keeps working as expected. Notice, though the length in bytes of the string changes, as you need multiple bytes to represent the Euro symbol in UTF-8.

Enter Visual Studio Code

While many modern editors use UTF-8 as their standard file format, a nice option we are considering to default to also in RAD Studio, Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM. The BOM (Byte Order Mark) is a sequence of bytes (3 bytes for for UTF-8) that marks the file to make it simple for an editor or a tool processing the file (like a compiler) to figure out the internal format. Given the similarities in many cases, you might have to read an entire file to see if it is UTF-8 or ANSI encoded.

When you open a Delphi unit in VSC, it keeps the formatting. If it is UTF-8 with BOM, it remains as such. But for a new file the default is UTF-8 with no BOM. And any file can be saved with that format, as you can see in the status bar options:

Now if you do this, compile the application again, you’ll get a nice caption for your button, rather than the Euro symbol:

Use a BOM or a Compiler Flag

Once you realize the issue, the solution is not that complex to achieve.

On one side, you can make sure your editors save the UTF-8 files with a BOM. The compiler sees it, and all works fine.

On the other side, you can tell the compiler to consider the source code file as UTF-8, regardless of the BOM. You can do this using the compiler flag:

--codepage:65001

1	--codepage:65001

or setting the codepage in the compiler options to that value:

Good Work Around, Future Solution

I hope this will help you avoid a similar issue. We are researching making the codepage detection automatic, so that the compiler could automatically pick up the right option. But not something we plan to addressing shortly… as the R&D team is busy with the coming release.

Special Live Webinar: Introducing Kai - A New Chapter for RAD Studio

Reduce development time and get to market faster with RAD Studio, Delphi, or C++Builder.
Design. Code. Compile. Deploy.

Start Free Trial Upgrade Today

Free Delphi Community Edition Free C++Builder Community Edition

1 Comment

Marcus Mangelsdorf

February 11, 2022 at 3:32 am

Dear Marco,

thank you so much for mentioning this issue and documenting workarounds! We actually ran into exactly the same problem quickly 🙂

The only thing I’d like to add is that “Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM” might be put a little harshly. There are arguments to be made for omitting the BOM to increase compatibility with unicode-unaware software and it seems like UTF-8 without BOM has evolved to be the de-facto standard encoding for basically everything.

It’s completely understandable that the Delphi compiler acts the way it acts for backward-compatibility reasons, but I think it might be more appropriate to call Delphi’s behavior a bit unexpected or quirky in a now UTF-8-dominated world.

The [Wikipedia article](https://www.wikiwand.com/en/Byte_order_mark#/UTF-8) on byte order mark sums it up quite nicely:
>>The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.<<

Best regards,
Marcus


Marcus Mangelsdorf

February 11, 2022 at 3:32 am

Dear Marco,

thank you so much for mentioning this issue and documenting workarounds! We actually ran into exactly the same problem quickly 🙂

The only thing I’d like to add is that “Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM” might be put a little harshly. There are arguments to be made for omitting the BOM to increase compatibility with unicode-unaware software and it seems like UTF-8 without BOM has evolved to be the de-facto standard encoding for basically everything.

It’s completely understandable that the Delphi compiler acts the way it acts for backward-compatibility reasons, but I think it might be more appropriate to call Delphi’s behavior a bit unexpected or quirky in a now UTF-8-dominated world.

The [Wikipedia article](https://www.wikiwand.com/en/Byte_order_mark#/UTF-8) on byte order mark sums it up quite nicely:
>>The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.<<

Best regards,
Marcus

The Delphi Compiler and UTF-8 Encoded Source Code Files With no BOM

Delphi and Source Files Encoding

Enter Visual Studio Code

Use a BOM or a Compiler Flag

Good Work Around, Future Solution

About author

Marco Cantu

1 Comment

Leave a ReplyCancel reply

Search

Something Fresh

Share What You Built With Kai For Recognition And Great Giveaways

Update Subscription Customers Invited to Join RAD Studio “Pasiphae” Beta

Kai 1.0.2 is Now Available

Popular Posts

Announcing the Availability of RAD Studio 13 Florence Update 1

The Spirit of C++: Freedom, Responsibility, and the Reality of Complex Systems

A Summary of Year 2025 for RAD Studio, Delphi, and C++Builder

Is C++ Too Complex?

Rethinking C++: Ignorance, Surface, and Deep Architecture

Categories

Popular From News

New in 10.3.2: C++17 for Win64 - target all Windows with the C++17 Clang compiler

Delphi 12 And C++Builder 12 Community Editions Released!

Submit Your Own Amazing Projects To The Embarcadero Showcase

We've Updated The HUGE Delphi Anniversary “Innovation Timeline” Infographic. Grab it Now!

Embarcadero InterBase 2020 Update 6 Released!

Latest From GetItNow

C++Builder @ stackoverflow

Delphi @ stackoverflow

InterBase @ stackoverflow

Categories

Useful Links

Follow us