Re: UTF-8N?

Date: Tue Jun 20 2000 - 16:50:01 EDT

JC> At one point, I thought that with Unicode there would be only one
JC>cross-platform encoding... Right now, it looks like there will be
JC>at least 8 Unicode encodings,

JC>Substantively, the only real encodings are UTF-8 and UTF-16.

We are being a little sloppy in terminology, which isn't a problem if we
all continue to understand one another, but can lead to problems as a
discussion leads off in different directions. In the terms of UTR#17, UTF-8
and UTF-16 are, of course, two *character encoding forms* that equally
represent the Unicode coded character set; UTF-8 has only one encoding
scheme, while UTF-16 has three - explicitly marked (in a higher protocol)
BE, explicit LE, and unmarked BE or LE.

There is an important sense in which we do have only one cross-platform
"encoding": "encoding" is often used in the sense of a coded character set
(in cases where the encoding form and encoding scheme simply fall through
under an identity transform), and with Unicode we do have only one
cross-platform coded character set. But there is an equally important sense
in which we have multiple "encodings", if by that we mean encoding forms or
encoding schemes. Really, what we want to be counting is encoding schemes
since that is what you need to write code to parse.

None of the encoding schemes make any requirement regarding use of the BOM,
but it is assumed for UTF-16 (and would also be for UTF-32) where
endianness isn't specified by a higher protocol. Obviously, the presence or
absence of the BOM can also be a factor in parsing other encoding schemes,
and we've heard mention that initial implementations, especially for UTF-8,
are un-even in this regard. To my knowledge, neither Unicode 3 nor any of
the UTR#s state any requirements in relation to the use of the BOM for
UTF-8. Unless implementers start making their code able to deal with UTF-8
regardless of whether or not an initial BOM is used, then the standard
should make a statement that implementers conform to. Without rules, users
will generate UTF-8 files that both do and don't start with a BOM. If there
is software out that that's going to blow up in one or the other case,
that's not a satisfactory state of affairs.

Peter Constable

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT