RE: Re: Unicode, Cure-all or Kill-all?

From: Geoff Back (geoff@autocue.MHS.CompuServe.COM)
Date: Tue Aug 13 1996 - 09:54:15 EDT

With all these discussions about whether characters that have separate
codepoints and identical glyphs should have been merged, there is one factor
that no-one has mentioned. Semantic analysis.

Taking the example below, if I am performing semantic analysis on raw Unicode
text for, say, spell checking, there is a clear and vital difference between
Latin "A" and Greek "A". This is where the Unicode definition of a character
comes into it's own.

In the case of Chinese and Japanese ideographs, I don't know a great deal
about the subject, so I can't say much. However it seems to me that given
the apparent need for immense numbers of codepoints, we shouldn't be trying
to get them all into the Unicode set - they should be in 10646, and thus
available to people that need them. Certainly in our application (broadcast
media) the Unicode set is acceptable to our clients using Chinese and


  -- Geoff.

From: MAIL@CSERVE {INTERNET:unicode@Unicode.ORG} on behalf of MAIL@CSERVE
Sent: 13 August 1996 08:09
Subject: Re: Re: Unicode, Cure-all or Kill-all?

J"org Knappen wrote:

>Martin Duerst wrote:
>>Assume I show you the character Tai2 (a triangle on top of a square),
>>alone. If you can tell me whether this is Taiwan, Typhoon, or Sir,
>>I will accept that we can use three separate codepoints. But I am
>>sure you can't.
>You confuse glyphs with characters.

J"org, I don't know how much Chinese, or Japanese or Korean,
you read/write, but it's definitely not as easy as that.
There are some documents in the standardization process, notably
by John Jenkins, that give the necessary changes to the character/
glyph model for CJK ideographs.

There are some very particular problems for CJK:
- The number of characters/glyphs is huge. You cannot assume everybody
        to know all the details of their history, and you cannot require
        historical expertise just to use a computer.
- For the same meaning (and history), sometimes character shapes
        are very close, but sometimes they are completely different,
        without many people knowing that it's actually the same meaning.

In addition, there are some general problems:

>Assume I show you in isolation something looking like `A'. Can you
>tell me from seeing it in isolation whether it is a Latin capital A, a
>Cyrillic capital A or a Greek capital Alpha? I bet you can't. It could also
>be a Latin small latter a, represented in a caps and small caps font, or
>the \forall quantifier turned 180 degrees.

First, your example of Latin/Cyrillic/Greek capital A/Alpha relies
on the current standard. It would very well have been possible
to code this as one codepoint only, if not for backwards compatibility.
Historically seen, it is really the same letter. There might be other
like Latin C/Cyrillic C, that somewhat fit better here. Anyway, even in
this case, unification might have been possible. The definition of
does not say anything about use in different scripts or different "meanings",

>Despite having similar or even identical glyphs, all these possible
>characters have correctly different codepoints. You have to gather the
>additional information to make the right choice.

It's truely identical glyphs, indeed it is just a single glyph. This is
difference to your A example. If you show me a representative selection
of Latin, Cyrillic, and Greek A, I will probably be able to distinguish them
(any type expert will do so immediately). However, if you give me a
representative selection of the three variants of Tai2, no chance to make
a distinction because they appear in the same fonts and as the same glyph.

The whole thing is somewhat comparable to e.g. hyphen/minus.
Unicode distinguishes hyphen and minus (besides having a generic
hyphen/minus), because in certain circumstances one might indeed
want to distinguish them and show them differently, although these
circumstances are rare and the distinction is definitely a burden on
the general user. But one could go further: distinguish minus in the
sense of numerical subtraction and in the sense of set difference
(and in many other senses it may be used). To a mathematician, these
are clearly idetifiable differences in meaning. However, it is a nice
theory, but without any practical relevance or sense. And it is an
exact parallel to the Tai2 case.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT