Re: Transcriptions of "Unicode"

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 06 2000 - 22:40:50 EST


James Kass said:

> Although the IRG still
> considers unification relevant, it seems to me that they
> are much tighter now in their definition of 'sameness'
> than was previously the case. Not all of the approx 40000
> "new" characters in Plane 2 are the names of race horses,
> some of them, as far as I can tell, would have been unified
> before.

I'll let the IRG participants speak to that one, but...

>
> Consider the "teeth" ideograph(s). (Radical number 211, in
> some radical lists.) Because this is a radical, CJK encoders
> can select the specific desired character:
> U+2FD2 for Traditional Chinese
> U+2EED for Japanese
> U+2EEE for Simplified Chinese

Uh oh! This is one of the dangers of these dang radicals.
First of all, the radicals are *not* intended to be used as
regular ideographic characters. That is why they all have the "So"
property, rather than "Lo". So if you go around recommending their
usage *instead* of the unified character for regular text, you
can end up with some strange behavior.

Note that the entire Kangxi radical set, U+2F00..U+2FD5, are
duplicate symbols for the radicals that *are* encoded as unified
characters in the main set. Effectively, they are all compatibility
characters.

The CJK radicals supplement, U+2E80..U+2EF3, are the ones that
show a number of specific forms, but those are intended for
special text purposes, as when specifying a radical index in
a dictionary.

> Since anyone encoding U+9F52 might see any of the above
> three versions, my opinion is that encoders (authors) would
> wish to explicitly encode their expected character and would
> do so whenever they have the option.

First of all, you missed the simplified version of 'teeth'
at U+9F7F. If someone explicitly wants the (Chinese) simplified
version, of course they should use that, and not U+2EEE, for
heaven's sake.

> I believe that they
> should have the option. The abundance of unassigned code
> points offered by additional Unicode planes makes this
> possible and would eliminate the need for a browser
> (or any other application) to "guess" a language in order
> to display material as its authors and users desire.

I think you are way overstating the scope of the problem.
Browsers can meet most of their users expectations merely by
having their Unicode font set to a Japanese font or a Chinese
font, as desired. It is only for fine control of mixed
language data that you may need more, and for that, it is
not unreasonable to expect that people will require language
and font markup.

I consider it pernicious to be suggesting that things would
be better if we just gave up on unification and encoded all
the glyphs. You might make things a little easier on the
rendering end (although the fonts would keep growing), but
the resultant problems of text equivalence for searching and
other text processes would just get much worse than it already
is for Han characters.

--Ken
>
> Best regards,
>
> James Kass.
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT