Re: The Arrogants and the sillies (RE: Euros and cents)

From: Thomas Chan (tc31@cornell.edu)
Date: Wed Mar 27 2002 - 06:50:08 EST


On Tue, 26 Mar 2002, Doug Ewell wrote:

> I'm surprised nobody took Dan the Silly Man to task on this one.
> > English enjoys new words on-the-fly.
> > What a pity Kanji on-the-fly is a taboo, at least on Unicode ;)

I think these were meant as rhetorical questions, but I'll bite,
particularly #3...

 
> Can you name a character encoding standard, anywhere in the world,
> invented by anybody -- government, industry consortium, private company,
> individual, kwijibo, ANYBODY -- that can do better in this regard than
> Unicode?

Besides the giant 70K+ repetoire which reduces the likelihood of an
unavailable character, there's always the PUA option. Some other
competitors in the Han character area don't even have that (ie., a
"gaiji" area), instead forcing one to submit such characters for
registration.

 
> Can you name a font technology that will support the display of these
> "invented-on-the-fly" Kanji?
>
> For that matter, can you invent a Kanji on the fly that cannot be
> represented (perhaps in a rather cumbersome way) with Ideographic
> Description Characters?

Yes, it's possible but uncommon. Unlike some other character description
schemes, IDS can only form characters by composition. e.g., there's no
way to gut out everything except the right half of U+8BD1 (yi4 'to
translate') and use the former right half as a component in describing
another character (as of Unicode 2.1--I haven't checked later versions.)
Such a component would need to be separately encoded for it to participate
in an IDS. Sometimes such components are not independent characters, or
they are rare independent characters that have been overlooked for
encoding. In this particular example, when U+776A occurs as part of a
character in unsimplified Chinese, then the simplified Chinese form would
have U+776A converted into the component mentioned above by application of
simplification rules (standing alone, U+776A is identical in simplified
form). Find all the characters containing U+776A as a component and
create the simplified forms by applying the rule--that'll generate plenty
of characters that IDS's can't represent. Another case is a character for
'Marxism'--it is U+9A6C with the final stroke gutted out, and replaced
with U+4E49 (Again, this example only checked to be true as of Unicode
2.1).

There are also an almost negligible number of cases such as U+4E52 and
U+4E53 (used to write ping1pang1 'ping pong') or U+5187 (used to write
Cantonese mou5 'to not have', among other words), which are created by
deleting of single strokes from U+5175 and U+6709, respectively. A
number of Vietnamese chu+~ no^m characters are also created in such
fashion. This is at a level smaller than the components that IDS work on,
and is really not a flaw of IDS.

IDS's, unlike some other description schemes, also don't handle
rotation--there are also an almost negligible number of cases where a
character (or a component) is formed by rotating another 180 degrees,
e.g., U+20114, which is U+4E88 rotated 180. However, this is so rare that
it wouldn't be a productive IDC if it were to exist.

IDS's also don't handle cases of "ligaturing", e.g., U+21155 (xi3 'double
happiness'), which is two U+559C side-by-side in origin. Distinguish from
U+56CD of the same meaning as U+21155, where ligaturing doesn't take
place.

IDS's also don't handle cases of guwen 'ancient character', which are
characters in pre-modern form that have been converted to modern form,
e.g., U+20A30, a tortured character which is really the zhuan 'seal' form
of U+5973 (nuu3 'woman; female') modernized. IDS's might handle it, but
clumsily. Others such as U+20066 are just impossible with IDS's.
However, this type are not likely to be created in this age, except as
modernizations of ancient forms.

Despite these counterexamples, IDS do handle the majority of unencoded Han
characters, most of which are the "left to right" or "above to below"
variety with respect to the particular IDC's used.

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Wed Mar 27 2002 - 08:04:20 EST