Re: Benefits of Unicode

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Sat Jan 27 2001 - 19:49:40 EST


On Sat, Jan 27, 2001 at 01:36:44PM -0800, Richard Cook wrote:
> Has anybody played devil's advocate to this, with a list of "Failings of
> Unicode"? Are there any? :-) This question might in fact result in a
> longer Benefits list ....
 
Here's a start (and it's true that a lot of these are double-edged,
and are the results of good design choices):

* Failure to gain the trust of many Japenese and Chinese

* Character placement randomness: frequently the order of characters
is random or appears so; often the only explanations are older standards*
or evolution, and even when there is a reason, it's different from block
to block. Stuff isn't always found in the rational block (for example,
all the stuff in the Basic Latin block; or the characters in the Letterlike
Symbols that should be in the Mathematical Alphanumeric block).

* Too much dependence on ASCII and Latin-1
        * The first 256 characters are essentially random. They should be
          sorted out into blocks like the rest of the characters.
        * The Latin-1 block has special exceptions, like U+00B5 not being
          canonically equivelent to U+03BC.
        * Stuff like U+0027 and U+0060 add confusion as to what should be
          used for the apostrophe (a large number of Unicode documents use
          U+0027 indiscriminately where U+2019 or U+2032 would be more
          "correct".)

* A lot of canonical equivelent or compatiblity equivelent
  characters exist that add complexity to Unicode for little
  or no expressitivity gain.
          * There's is no documentation on which compatibility
          characters really shouldn't be used, although people
          claim on unicode@unicode various characters that shouldn't
          be used.

* Having precomposed letters adds complexity without additional
  expressitivity. You have to worry about two forms of (a-umlaut)
  and the like anywhere you want to do comparisons.

* Unicode is too intimately connected with UTF-16. UTF-16 could have
  been done by moving U+E000-U+FFFF down 800h spaces and starting the
  supplementary characters at U+F800. (So U+F800 has some character value,
  but F800 is a surrogate code in UTF-16, and surrogates can be combined
  in some way to get U+F800.)

* Sort of releated to the last one, foo with combining character bar should
  be foo with bar, not foo with apostraphe or foo with comma. For example,
  U+0165 is SMALL T WITH CARON and equals t plus combining caron, but is
  shown as t', making it impossible to represent SMALL T WITH CARON in
  Unicode.

(One beefs that aren't really mine . . .)

* A lot of characters are combined in ways that make it harder to use
  in a way that requires careful rich text to look decent. Combining of one
  Serbian character with a Russian character (the one that looks differnt
  in italics), the mixing of CJK ideographs, the combining of Coptic and
  Greek, the mixing of Fraktur and Latin characters (which would have
  a big deal 75 years ago).

As you can probably see, I'm a theorist with a distinct view of what the
universal character encoding should look like. Fortunetly, unlike the
Rosetta guy and the Tron people, I know that Unicode is the right combination
of support and 90% solution to win, even against superior opposition, and
that I am not a worldwide expert on languages, and any attempt I made
would be hopelessly cribbed from Unicode (see the Tron character set) and
ineptly designed out of the Latin block.

> > Tex Texin wrote:
> > Any applications
> > reading the same
> > An ISO standard Standards insure text file will
> > interoperability
> > interpret it
> > correctly

Depending on your audience, I might cut this back some. First, the whole
"Standards insure interoperability" almost makes me laugh. Even with the
POSIX and ISO C++ standards out there, moving programs depending on those
standards from one system to another can be difficult. And those are
standards that are generally respected - the extended Pascal standard
and BASIC standard often don't even get lib service paid to them.

Also, "any applications reading the same text file will interpret it
correctly" isn't true; almost any program reading Latin/Greek/Chinese
text will handle it (provided the right fonts), but many won't handle
RTL languages, many won't handle combining characters, many won't handle
proper formatting for Arabic and Indic languages, and many won't handle
suppchars.

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT