RE: Names for UTF-8 with and without BOM

From: Joseph Boyle (Boyle@siebel.com)
Date: Sat Nov 02 2002 - 12:43:28 EST

Next message: Joseph Boyle: "RE: Names for UTF-8 with and without BOM"

Previous message: Thomas Lotze: "Re: ct, fj and blackletter ligatures"
Maybe in reply to: Joseph Boyle: "Names for UTF-8 with and without BOM"
Next in thread: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The main need I see is not to tell a consumer whether a leading U+FEFF is a
BOM or ZWNBSP, but:

* for producers (telling whether to emit a BOM or not), and
* normative (a checker enforcing an encoding standard per file type, defined
in a table like the one below)

Type Encoding Comment
.txt UTF-8BOM We want plain text files to have BOM to distinguish
from legacy codepage files
.xml UTF-8N Some XML processors may not cope with BOM
.htm UTF-8 We want HTML to be UTF-8 but will not insist on BOM
.rc Codepage Unfortunately compiler insists on these being
codepage.
.swt ASCII Nonlocalizable internal format, must be ASCII.

Please consider the proposal for separate charset names on that basis and
not on the basis of utility for telling a consumer whether U+FEFF is a BOM,
which I agree is by now a nonissue.

-----Original Message-----
From: Michael (michka) Kaplan [mailto:michka@trigeminal.com]
Sent: Saturday, November 02, 2002 4:18 AM
To: Mark Davis; Murray Sargent; Joseph Boyle
Cc: unicode@unicode.org
Subject: Re: Names for UTF-8 with and without BOM

From: "Mark Davis" <mark.davis@jtcsv.com>

> That is not sufficient. The first three bytes could represent a real
content
> character, ZWNBSP or they could be a BOM. The label doesn't tell you.

There are several problems with this supposition -- most notably the fact
that there are cases that specifically claim this is not recommended and
that U+2060 is prefered?

> This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE
0xFF
> represents a BOM, and is not part of the content. In the second case,
> it does *not* represent a BOM -- it represents a ZWNBSP, and must not
> be stripped. The difference here is that the encoding name tells you
> exactly what the situation is.

I do not see this as a realistic scenario. I would argue that if the BOM
matches the encoding scheme, perhaps this was an intentional effort to make
sure that applications which may not understand the higher level protocol
can also see what the encoding scheme is.

But even if we assume that someone has gone to the trouble of calling
something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind
of content *is* such a code point that this is even worth calling out as a
special case?

If the goal is to clear and unambiguous text then the best way would to
simplify ALL of this. It was previously decided to always call it a BOM, why
not stick with that?

MichKa

Next message: Joseph Boyle: "RE: Names for UTF-8 with and without BOM"
Previous message: Thomas Lotze: "Re: ct, fj and blackletter ligatures"
Maybe in reply to: Joseph Boyle: "Names for UTF-8 with and without BOM"
Next in thread: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 13:23:11 EST