Re: UTF-8 and UTF-16 issues

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Jun 19 2000 - 18:11:45 EDT


"OLeary, Sean (NJ)" wrote:
> UTF-16 is the 16-bit encoding of Unicode that includes the use of
> surrogates. This is essentially a fixed width encoding.

certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit units per character. certainly the iuc discussion did not spread this under "utf-16" but possibly as "ucs-2".
you can make the point, and this could have been said there, too, that for many characters you know they will use exactly one 16-bit unit, and you don't need to process surrogates for that. this is not to say the encoding is fixed-width; it is the same as how you deal with ascii characters in utf-8, without declaring utf-8 to be fixed-width.

> UTF-8
> Cons:
> * Most characters need to expanded into a UTF-16 form prior to table lookups
> for character properties or codepage mappings.

rather, i would expect an "expansion" into a 32-bit value, not into surrogate pairs. this is more practical (and needs to be done for utf-16, too).

> UTF-16
> Pros:
> * Fixed width of 16 bits makes most character processing easier.

no :-(

> Both encodings have to deal with issues like surrogates

utf-8 doesn't deal with surrogates. it deals with its own encoding, but surrogates are only used in utf-16.

> 3. Interface to third party tools and APIs.

you should mention that everything microsoft (win32 on nt/2000/ce, com, office, sql server, ...) uses utf-16. java, xml dom, qt/kde also use utf-16, like a number of other important apis. there is a similar, possibly longer, list for utf-8. it is good for people to make decisions to tell them the big apis and apps that use either utf.

> the BOM was intended to be used in 16-bit encodings like UTF-16, not in
> UTF-8.

it is still useful to use the signature byte sequences in all unicode encodings. the xml spec, for example lists them as a help for the parser. if it is not generally recommended, then it should be. for utf-8 and scsu it just indicates these encodings without also needing to indicate endianness. the signature still serves a purpose.

> * UTF-8 encoding has no endian-ness issues, therefore the use of a leading
> BOM sequence in a UTF-8 file is discouraged. A possible exception to this is
> in a UTF-8 encoded file that is known to contain non-ASCII characters. A

this is the whole point of saying it is utf-8 and not us-ascii, right?

> * Code should only add a BOM at the beginning of a file if it is absolutely
> needed. the practice of adding a BOM to files has broken many applications
> that worked correctly without the BOM.

the opposite can be said, too. if you try to compile a utf-16 .c file with msvc, or want to open a unicode file in notepad, then you will need the bom.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT