J"org Knappen wrote:
>Martin Duerst wrote:
>>Assume I show you the character Tai2 (a triangle on top of a square),
>>alone. If you can tell me whether this is Taiwan, Typhoon, or Sir,
>>I will accept that we can use three separate codepoints. But I am
>>sure you can't.
>You confuse glyphs with characters.
J"org, I don't know how much Chinese, or Japanese or Korean,
you read/write, but it's definitely not as easy as that.
There are some documents in the standardization process, notably
by John Jenkins, that give the necessary changes to the character/
glyph model for CJK ideographs.
There are some very particular problems for CJK:
- The number of characters/glyphs is huge. You cannot assume everybody
to know all the details of their history, and you cannot require
historical expertise just to use a computer.
- For the same meaning (and history), sometimes character shapes
are very close, but sometimes they are completely different,
without many people knowing that it's actually the same meaning.
In addition, there are some general problems:
>Assume I show you in isolation something looking like `A'. Can you
>tell me from seeing it in isolation whether it is a Latin capital A, a
>Cyrillic capital A or a Greek capital Alpha? I bet you can't. It could also
>be a Latin small latter a, represented in a caps and small caps font, or
>the \forall quantifier turned 180 degrees.
First, your example of Latin/Cyrillic/Greek capital A/Alpha relies
on the current standard. It would very well have been possible
to code this as one codepoint only, if not for backwards compatibility.
Historically seen, it is really the same letter. There might be other examples,
like Latin C/Cyrillic C, that somewhat fit better here. Anyway, even in
this case, unification might have been possible. The definition of "character"
does not say anything about use in different scripts or different "meanings",
>Despite having similar or even identical glyphs, all these possible
>characters have correctly different codepoints. You have to gather the
>additional information to make the right choice.
It's truely identical glyphs, indeed it is just a single glyph. This is another
difference to your A example. If you show me a representative selection
of Latin, Cyrillic, and Greek A, I will probably be able to distinguish them
(any type expert will do so immediately). However, if you give me a
representative selection of the three variants of Tai2, no chance to make
a distinction because they appear in the same fonts and as the same glyph.
The whole thing is somewhat comparable to e.g. hyphen/minus.
Unicode distinguishes hyphen and minus (besides having a generic
hyphen/minus), because in certain circumstances one might indeed
want to distinguish them and show them differently, although these
circumstances are rare and the distinction is definitely a burden on
the general user. But one could go further: distinguish minus in the
sense of numerical subtraction and in the sense of set difference
(and in many other senses it may be used). To a mathematician, these
are clearly idetifiable differences in meaning. However, it is a nice
theory, but without any practical relevance or sense. And it is an
exact parallel to the Tai2 case.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT