Re: UTF-8N?

From: Peter_Constable@sil.org
Date: Wed Jun 21 2000 - 11:32:53 EDT


On 06/20/2000 08:20:53 PM <dewell@compuserve.com> wrote:

[snip]

>It may be useful shorthand to define the term "UTF-8N" to refer to UTF-8
text
>that does not begin with a BOM, and reserve the term "UTF-8" for text that

>*does* begin with a BOM,

"UTF-8" currently does not, and so should not, be used to indicate the
definite presence of a BOM.

>but the fact is that both are really UTF-8, and people
>will use the term "UTF-8" to refer to both.

And rightly so.

> Adding (let alone registering) a
>new charset name to express this relatively minor difference will make it
look
>(as it does to Juliusz) like there are more Unicode encoding forms than
there
>really are.

We don't want distinct encoding schemes (schemes, I think, not forms) for
the UTF-8 encoding form that are distinguished by the presence or the
absence of a BOM. Presence or absence of a BOM doesn't constitute a
difference in encoding scheme for UTF-8, or even for UTF-16, for that
matter, because it is something separate from the character stream itself.
UTF-8 files both with and without a BOM serialize the character
representations into bytes (octets) in exactly the same way. That's the
basis for distinguishing between encoding schemes, and since there isn't a
difference, there is only one encoding scheme involved in both cases.

Peter Constable



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT