Re: TC/SC mapping

From: John H. Jenkins (jenkins@apple.com)
Date: Thu Jan 24 2002 - 15:11:18 EST


On Thursday, January 24, 2002, at 12:29 PM, John Cowan wrote:

> John H. Jenkins wrote:
>
> {TC1, SC1, SC2, TC2, TC3, SC3} constitute a "Han simplification
> class" (HSC), and are all the same when appearing in IDNs.
>
> Correct?
>

Oui.

>
>> The caveat is that this must be understood to be a first-order,
>> computer-appropriate equivalence and is not in any way to be held to be
>> a generalized solution to the lexically appropriate conversion between
>> SC and TC.
>
>
> Is there any danger that these classes will turn out to be a
> "small world", in the sense that we wind up with a few huge classes
> which include almost all the characters?
>

Nope.

>> (Maybe we should refer to *zhengguihua* instead of "Han normalization"…)
>
>
> Can you explain the joke?
>

It's just to make Ken happy. He doesn't like me talking about "Han
normalization," since "normalization" is Unicodespeak for something else.
"Zhengguihua" is Mandarin for "normalization."

>> It will also mean that we will no longer be able to accept both the TC
>> and SC form for a character as a candidate for separate encoding in the
>> future,
>
>
> I don't understand this part. Since this is neither compatibility nor
> canonical equivalence, it will not effect any of the known normalization
> forms. Nor are we defining a new normalization form here, since in
> HSCs like the above there is no particular reason to pick any of the
> six characters as *the* normalized form, although by convention we can
> pick one -- say, the one with the smallest Unicode scalar
> value, or the one which appears in the largest number of legacy
> sets -- to aid in description and implementation.
>
> It's just another of those sets of equivalence classes provided for
> special purposes, like the Arabic/Syriac shaping classes or the
> canonical combining classes.
>

Well, first of all, the UTC is already on record as refusing to encode new
SC separately.

Secondly, we would break IDN equivalence. If we add a new SC which is
equivalent to two TC, then suddenly domains which could be distinguished
on the basis of the old TC pair can't any more.

> Or are you saying that this new information should be represented
> as a Unicode compatibility equivalence? If so, that would
> wreak havoc with existing NCF and NKCF code.
>

No,

>> (Actually, you could save yourself some grief right off by excluding Han
>> radicals and all compatibility ideographs.)
>
> This would be a Bad Thing in Korean, though, because the whole point
> of Korean compatibility ideographs is to preserve differences in
> reading. Or are ideographs not used in (modern) Korean names?
>

These compatibility ideographs are *not* to provide phonetic-specific
distinctions between various Korean hanja. They're for compatibility with
an older standard only, which did make that distinction. IMHO it would be
more confusing to Chinese, Japanese, *and* Korean readers to have some
domain names distinguished when the the only thing different about them is
the Korean pronunciation of the hanja used to write them.

==========
John H. Jenkins
jenkins@apple.com
jenkins@mac.com
http://homepage.mac.com/jenkins/



This archive was generated by hypermail 2.1.2 : Thu Jan 24 2002 - 14:50:05 EST