Re: Unicode, Cure-all or Kill-all?

From: John H. Jenkins (tseng@sj-coop.net)
Date: Tue Aug 13 1996 - 14:45:31 EDT


Martin J Duerst, mduerst@ifi.unizh.ch wrote:

>
>J"org Knappen wrote:
>
>>Martin Duerst wrote:
>>
>>>Assume I show you the character Tai2 (a triangle on top of a square),
>>>alone. If you can tell me whether this is Taiwan, Typhoon, or Sir,
>>>I will accept that we can use three separate codepoints. But I am
>>>sure you can't.
>>
>>You confuse glyphs with characters.
>
>J"org, I don't know how much Chinese, or Japanese or Korean,
>you read/write, but it's definitely not as easy as that.
>There are some documents in the standardization process, notably
>by John Jenkins, that give the necessary changes to the character/
>glyph model for CJK ideographs.
>

The model followed by the IRG -- based ultimately on Japanese standards
practice -- makes a three-fold distinction.

X-variants: Different semantics, different abstract shape
Y-variants: Same semantics, different abstract shape
Z-variants: Same semantics, same abstract shape but different actual
shape

"A" and "B" are examples of X-variants.

A grotesque "a" and gothic "a" are examples of Y-variants.

A Geneva "a" and Helvetica "a" are examples of Z-variants.

In Western practice -- heck, for everything except CJK -- Y- and
Z-variations are folded together as glyphic variations. There are few,
if any, serious suggestions that anybody treat gothic and grotesque "a"'s
as distinct characters. (Basically, if you took your average man on the
street, showed them a gothic "a" and a grotesque "a" and asked if they
were the same letter or different letter -- they'd say they were the
same.)

The CG model is designed around this.

OTOH, in CJK, Y-variations are *not* considered glyphic (again, basically
by the average man on the street test), and the CG model falls down.

Even the XYZ-model, however, is an idealization, since there are
instances of semantic overlap -- where one character can be used for
another in most (but not all) cases.

>There are some very particular problems for CJK:
>- The number of characters/glyphs is huge. You cannot assume everybody
> to know all the details of their history, and you cannot require
> historical expertise just to use a computer.

I've just been listening to my "Hitchhiker's Guide to the Galaxy" tapes,
so I'm going to paraphrase Douglas Adams:

The number of characters/glyphs is huge. Really, really huge. You have
no idea how mind-boggling huge the number of characters/glyphs is. You
may think there are a lot of characters/glyphs in a printer's dingbat
set, but that's just peanuts compared to CJK ideographs.

And so on.

Although there aren't a lot of people in the standards industry who'd
agree with Timothy Huang that there are some 75,000 distinct CJK
ideographs that should be separately encoded, I don't think he'd have any
disagreement that there are 75,000 (and more) distinct *somethings* that
dance about in the field with "character" demarked at one end and "glyph"
at the other.

>- For the same meaning (and history), sometimes character shapes
> are very close, but sometimes they are completely different,
> without many people knowing that it's actually the same meaning.
>

In fact, the whole rationale of the "average man on the street test"
fails. The number of characters is so huge that innumerable characters
are known only to experts holed up in dusty little offices. If you show
your average computer user some of these characters and ask them if
they're the same or different (or how to pronounce them or what they
mean), they'll stare at you blankly.

And it isn't difficult to find two "average" people who disagree as to
where the line is drawn between X-variant, Y-variant, and Z-variant for a
given pair of ideographs.

As a Cantonese speaker, for example, I'm aware of a number of characters
which Mandarin speakers tend to use interchangeably although in Cantonese
they have very distinct meanings and pronunciations. And, of course,
there are a number of "Cantonese" characters which Mandarin speakers
would eschew altogether as gutterisms.

Becker's Law -- for every expert there is an equal and opposite expert --
comes into play here in full force.

Nor are the histories of all known ideographs equally known. Some may be
misprints for others which have acquired lives of their own. So what do
you do?

Dictionary makers tend to be very conservative and distinguish as much as
possible. Unless you *know* that a certain author thought two different
hanzi were just variations, you'd prefer to include both.

Standards people tend to be rather more liberal and willing to unify, or
to leave out Y-variants of characters they include. After all, so long
as people can write *something* that conveys the meaning they want, it
doesn't matter particularly what it *is* -- does it? (Or so, at least,
runs the philosophy.)

>The whole thing is somewhat comparable to e.g. hyphen/minus.
>Unicode distinguishes hyphen and minus (besides having a generic
>hyphen/minus), because in certain circumstances one might indeed
>want to distinguish them and show them differently, although these
>circumstances are rare and the distinction is definitely a burden on
>the general user. But one could go further: distinguish minus in the
>sense of numerical subtraction and in the sense of set difference
>(and in many other senses it may be used). To a mathematician, these
>are clearly idetifiable differences in meaning. However, it is a nice
>theory, but without any practical relevance or sense. And it is an
>exact parallel to the Tai2 case.
>

Excellent example.

John H. Jenkins
tseng@sj-coop.net
jenkins@apple.com
http://www.sj-coop.net/~tseng



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT