22 February 2010

System.String hidden UTF8 BOM

In .NET, a string (System.String) can contain an initial UTF-8 Byte Order Mark (BOM) which might not be seen in ordinary processing but is present when converted to a character array or into an encoding byte array.

For example, a text file might be saved in UTF8 format with UTF-8 Byte Order Mark bytes at the start, ie 0xEF 0xBB 0xBF. You might receive this file in ASP.NET using a FileUpload control, or read it directly in a Forms .NET app in C#:

byte[] FileBytes = File.ReadAllBytes(path);
string content = Encoding.UTF8.GetString(FileBytes);

If the file contains these 7 bytes (in hex) EF BB BF 44 65 61 72 then content will superficially contain the single word "Dear", eg as seen in the debugger, and content.StartsWith("Dear") will return true.

However, content.Length is 5 and content.ToCharArray() will return an array with 5 elements, the first being set to 0xFEFF. Similarly, Encoding.UTF8.GetBytes(content) will return the same 7 bytes as was used in the first place.

(Note that that has nothing to do with the encoderShouldEmitUTF8Identifier optional parameter for the UTF8Encoding constructor.)

As this hidden extra character can be misleading, I have written the following snippet that detects the presence of the UTF8 Byte Order Mark preamble and ignores it if present:

byte[] FileBytes = File.ReadAllBytes(path);
int StartPoint = 0;
int Count = FileBytes.Length;
if ( Count>= 3 && FileBytes[0] == 0xEF && FileBytes[1] == 0xBB && FileBytes[2] == 0xBF)
StartPoint += 3;
Count -= 3;
content = Encoding.UTF8.GetString(FileBytes, StartPoint, Count);

PS The code could not doubt be improved using Encoding.GetPreamble()


William said...

I find this behavior annoying. The Unicode specification states: "Where the data is typed, such as a field in a database, a BOM is unnecessary. Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string
concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM)."

I also don't like the fact that the behavior of GetString()/GetBytes() is predicated on whether the argument contains a BOM. If the byte array has any BOM in the first 2-3 bytes, then the returned string starts with the garbage character 0xFEFF (actually this is the UTF-16 BigEndian BOM). If there is no BOM in the byte array, then the string is well-formed. Likwise if you call GetBytes() with a string argument that has the 0xFEFF, the resulting byte array will contain a BOM (regardless of the encoding used to convert to bytes the BOM will always be correct). If the string has no garbage BOM, then there is no BOM in the byte array. Of course, then you have to prepend the BOM yourself. This behavior is a hidden mechanism that is not documented and more than a little annoying as it can screw thing up (like two BOMs in a file).

Don Stuber said...

If you are creating the Byte[] from a String, it is possible to ensure no BOM is generated by creating your own Encoding instance using

new System.Text.UTF8Encoding(false)

For example, when using a StreamWriter to produce Byte[], you can instance using

using (StreamWriter feedWriter = new StreamWriter(outputStream,
new System.Text.UTF8Encoding(false))) {