RE: Translated IUC10 Web pages: Experimental Results

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Mon Feb 10 1997 - 06:57:27 EST

Next message: Misha Wolf: "Web browsers and the new language code for Hebrew"
Previous message: Martin J. Duerst: "Re: Word97: Preliminary Experimental Results"
In reply to: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Next in thread: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 6 Feb 1997, Murray Sargent wrote:

> Thanks for pointing out that The Unicode Standard very clearly spells
> out that UCS-2 is big endian; don't know how I missed it!
>
> I don't want to get
> into a senseless discussion as to which order is natural

Me neither!

> (I do prefer little endian);

Well, in German we have an interesting saying:

Man gewo"hnt sich an allem, sogar am Dativ.

>both orders exist so we need to deal with them.

Exactly. The people who worked on Unicode knew this from the start,
and they specified a good solution.

> Re Win32 apps maybe not handling big-endian plain-text files, I agree
> it's very parochial, but it's probably reality.

It's short time reality. MS has a good record at fixing bugs in their
X.0 versions, and this is a place for such a fix.

> One thing I've found
> again and again in software is that widespread conventions can just
> happen, regardless of the good intentions of standards committees and
> company architects.

There is an American (or maybe it is British) "saying" for this occasion:

Shit happens.

I hope MS can and will do better than that, and can help support and promote
the standards they helped developping, instead of producing applications
that might lead to strong UNicode criticism. If some of the UNicode
enemies e.g. in Japan find out what is currently happening with
Word and such (i.e. that even WITH a BOM, Word cannot read Unicode
files if they are the RIGHT way around), they will be very quick
to spread rumors, which for once even will be true.

> A better choice than plain text is probably HTML
> with UTF-8, which avoids the issue altogether. But HTML is in a state
> of rapid evolution...

Plain text or HTML and UCS2 or UTF-8 should be treated orthogonally
anyway. There is not much reason to link HTML with UTF-8 and plain text
with UCS2. And both HTML and plain text can be in many other encodings.

> In any event, I think it would be crazy to have a plain-text Unicode
> file that didn't start with a BOM of some kind.

The BOM is extremely valuable as a magic number. So I definitely
agree with you. But there is a chearished internet practice, which
can help solve the problem here:

- Be restrictive in what you produce
        This means that you only produce BIG-endian with
        a BOM, because this can be read by all those
        applications that are strictly conformant (which
        does not include the BOM) as well as those that
        recognize the BOM).
- Be liberal in what you accept
        This means that you accept BIG-endian without a BOM,
        and both endians with a BOM.

What the office 97 applications currently do is very far from
the above.

> Else you'd probably be
> better off interpreting it as 8859-1, not Unicode. Remember that
> backward compatibility is a major issue and 8859-1 (or 1252) is a pretty
> good guess to make for a random plain-text file.

There are cases where you just read in a file without giving the
user a chance to tell you what it is. In that case, 8859-1 might
be the best guess, but it is not a very good one either. There
are dozens of other encodings, and all of the use the .txt extension.

The other way to do it is to give the user a chance to tell the
system what encoding the text file is in. You have to be careful
to not confuse the general users with strange numbers such as
8859 or 1252, but it is a much more general solution.

Anyway, arguments such as "would be craizy not to start with a BOM"
or "better off interpreting it as 8859-1" are correct, but in your
mail, they just serve to detract from the main issue:

What the applications mentionned do is clearly against what the
standards say, is highly affecting interoperability, and is
detrimental to expectations that a lot of people have put into
Unicode.

What I, and a lot of others on this list and elsewhere, as very
concerned UNicode citizens, are expecting from you and others at MS,
as (hopefully) good UNicode citizens, is not further arguments
about why the blunders you made may not be as bad as they actually
are, but a clear commitment and fast ACTION to get things right.

Many thanks in advance, Martin.

Next message: Misha Wolf: "Web browsers and the new language code for Hebrew"
Previous message: Martin J. Duerst: "Re: Word97: Preliminary Experimental Results"
In reply to: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Next in thread: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT