A really bad article about Unicode

From: Doug Ewell (dewell@compuserve.com)
Date: Tue May 02 2000 - 00:19:35 EDT


The Unicode Web site includes a page titled "Press Information" at

    http://www.unicode.org/press/

This page contains links to a variety of articles, written by members of
the general computing press, that have something to do with Unicode.

One of the articles, written by Amy Burns of Microsoft, is called
"Unicode, UTF-8, UCS-2, UCS-4 ... What Is All of This?" and is available
at

    http://msdn.microsoft.com/workshop/management/intl/unicode.asp

This article, written in January 1998 and intended for designers of Web
pages, is so full of misinformation, inappropriate emphasis, and just
plain silly errors that I wonder why the Unicode Consortium chose to
include it in what appears to be a list of "recommended" articles.

In a section called "So Who's in and Who's out?" there is a gratuitous
discussion of the difference between "primary" and "pseudo" scripts,
which is unlikely to tell a Web page designer much except that Unicode
is big and complicated and scary. This is followed by lists of supported
scripts, and longer lists of unsupported ones. The overall impression is
that the Unicode architects have chosen to exclude a great many scripts,
seemingly arbitrarily.

"UCS-2" is identified one-for-one with "Unicode," whereas "UCS-4"
(mistyped "USC-4" as often as not) is identified one-for-one with "ISO
10646." This is misleading, considering that the repertoires of Unicode
and ISO 10646 are identical, and plans to encode Unicode characters in
the Astral Planes were well known before 1998.

The next paragraph not only reverses this artificial 2-byte/4-byte
distinction, but is absolutely the worst description of Unicode I have
ever seen:

    This means if you are using Unicode, your text is being broken up
    every four bytes and sent through the ozone to be reconstructed at
    the other end. If you are using "supported" scripts, you're okay.
    It will put your words back the way it found them when they reach
    their destination. If you use a language that Unicode does not
    currently support, your text will appear corrupted at the other end.
    Perhaps the words will be munged, or extra spaces will be added, or
    some other creative interpretation.

Just imagine yourself sitting in a meeting room, listening to someone
describe Unicode to a Web designer this way.

Burns then begins a cursory discussion of UTF-8, which she says "allows
32-bit encoding of ISO 10646, and breaks up your characters between each
byte instead of every four bytes." O-kay, I'm glad we cleared that up.
(At least the UTF-8 examples are correct.)

The article concludes with another Unicode-is-scary statement:

    Diving into the depths of Unicode gets to be a serious lesson in
    octets, binary, division and positive visualization.

but it sounds like Burns is the one with a serious case of the bends.

With all the Microsoft experts on this mailing list, it should be easy
to find an article written by someone from Microsoft that expresses some
knowledge and understanding about Unicode. The Burns article reads like
a bad junior high school essay, and does not deserve to be linked to the
Unicode Web site.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT