On Thu, 6 Feb 1997, Murray Sargent wrote:
> Thanks for pointing out that The Unicode Standard very clearly spells
> out that UCS-2 is big endian; don't know how I missed it!
> I don't want to get
> into a senseless discussion as to which order is natural
> (I do prefer little endian);
Well, in German we have an interesting saying:
Man gewo"hnt sich an allem, sogar am Dativ.
>both orders exist so we need to deal with them.
Exactly. The people who worked on Unicode knew this from the start,
and they specified a good solution.
> Re Win32 apps maybe not handling big-endian plain-text files, I agree
> it's very parochial, but it's probably reality.
It's short time reality. MS has a good record at fixing bugs in their
X.0 versions, and this is a place for such a fix.
> One thing I've found
> again and again in software is that widespread conventions can just
> happen, regardless of the good intentions of standards committees and
> company architects.
There is an American (or maybe it is British) "saying" for this occasion:
I hope MS can and will do better than that, and can help support and promote
the standards they helped developping, instead of producing applications
that might lead to strong UNicode criticism. If some of the UNicode
enemies e.g. in Japan find out what is currently happening with
Word and such (i.e. that even WITH a BOM, Word cannot read Unicode
files if they are the RIGHT way around), they will be very quick
to spread rumors, which for once even will be true.
> A better choice than plain text is probably HTML
> with UTF-8, which avoids the issue altogether. But HTML is in a state
> of rapid evolution...
Plain text or HTML and UCS2 or UTF-8 should be treated orthogonally
anyway. There is not much reason to link HTML with UTF-8 and plain text
with UCS2. And both HTML and plain text can be in many other encodings.
> In any event, I think it would be crazy to have a plain-text Unicode
> file that didn't start with a BOM of some kind.
The BOM is extremely valuable as a magic number. So I definitely
agree with you. But there is a chearished internet practice, which
can help solve the problem here:
- Be restrictive in what you produce
This means that you only produce BIG-endian with
a BOM, because this can be read by all those
applications that are strictly conformant (which
does not include the BOM) as well as those that
recognize the BOM).
- Be liberal in what you accept
This means that you accept BIG-endian without a BOM,
and both endians with a BOM.
What the office 97 applications currently do is very far from
> Else you'd probably be
> better off interpreting it as 8859-1, not Unicode. Remember that
> backward compatibility is a major issue and 8859-1 (or 1252) is a pretty
> good guess to make for a random plain-text file.
There are cases where you just read in a file without giving the
user a chance to tell you what it is. In that case, 8859-1 might
be the best guess, but it is not a very good one either. There
are dozens of other encodings, and all of the use the .txt extension.
The other way to do it is to give the user a chance to tell the
system what encoding the text file is in. You have to be careful
to not confuse the general users with strange numbers such as
8859 or 1252, but it is a much more general solution.
Anyway, arguments such as "would be craizy not to start with a BOM"
or "better off interpreting it as 8859-1" are correct, but in your
mail, they just serve to detract from the main issue:
What the applications mentionned do is clearly against what the
standards say, is highly affecting interoperability, and is
detrimental to expectations that a lot of people have put into
What I, and a lot of others on this list and elsewhere, as very
concerned UNicode citizens, are expecting from you and others at MS,
as (hopefully) good UNicode citizens, is not further arguments
about why the blunders you made may not be as bad as they actually
are, but a clear commitment and fast ACTION to get things right.
Many thanks in advance, Martin.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT