Re: Traditional and Simplified Han in UTS 39

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Wed, 27 Dec 2017 21:24:52 -0800

The full excerpt from the UTS reads:

> Mark Chinese strings as “mixed script” if they contain both simplified
> (S) and traditional (T) Chinese characters, using the Unihan data in
> the Unicode Character Database [UCD
> <http://www.unicode.org/reports/tr39/#UCD>].
>
> 1. The criterion can only be applied if the language of the string is
> known to be Chinese. So, for example, the string “写真だけの結婚式 ”
> is Japanese, and should not be marked as mixed script because of a
> mixture of S and T characters.
> 2. Testing for whether a character is S or T needs to be based not on
> whether the character /has/ a S or T variant , but whether the
> character /is/ an S or T variant.
>

There are several issues with this.

First and foremost, the definition of S and T variants is not something
that is universally agreed upon. The .cn, .hk or .tw registries are
using a definition of S and T variants that does not agree with the
Unihan data in many particulars. Therefore, using the Unihan data would
result in false positives. (And false negatives).

Second, there are many characters that are variants that are acceptable
with both "S" or "T" labels. You only have to look at the published
Label Generation Rulesets (or IDN tables) for these domains to see many
examples. And, as mentioned above, you cannot reverse engineer these
tables from Unihan data.

Third, the same domains mentioned have a policy of delegating up to
three label to the same applicant: a "traditional", "simplified" and a
mixed label matching the spelling of the label in the original
application (for situations where a mixed label is appropriate). In
other words, certain mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing any
other mixed label that is a variant of the three from being allocated to
an unrelated party. If you "know" that the language has to be Chinese,
because the domain is a ccTLD, then at the same time the check is
superfluous. Other registries are not known to have similar policies, so
for them additional spoof detection may be useful --- however it is
precisely those cases where it's impossible to know whether a label is
intended to be in the Chinese language.

Fifth, generally the only thing that can be ascertained is that a label
is *not* in Chinese: by virtue of having Kana or Hangul characters mixed
in. However, the reverse is not true. You will find labels registered
under .jp that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the state
of the art is to have a coordinated policy that prevents "random"
variant labels from coexisting in the registry. An example of this kind
of effort is being developed for the root zone. By definition, for the
root zone, there is no implied information about the language context,
unlike the case for the second level, where the presence of a ccTLD in
the full domain name may give a clue.

Seventh, attempting to determine whether a label is potentially valid
based on variant data (of any kind) is doomed, because actual usage is
not limited to "pure" labels. The variant mechanism is something that
works differently (in those registries that apply it): instead of
looking at a single label, the registry can implement "mutual
exclusion". Once one variant label from a given set has been delegated,
all others are excluded (or in practice, all but three, which are
limited to the same applicant). Without access to the registry data, you
cannot predict which variants in a set are the "good ones", and with
access to the data, spoof labels are rejected and cannot be registered.

In conclusion, my recommendation would be to retract this particular
passage.

A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:
> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as “mixed script” if they contain both
> simplified (S) and traditional (T) Chinese characters, using the
> Unihan data in the Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is
> known to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"? Is
> this something algorithmically determinable, or does it come from
> information about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text.  That clearly
> indicates the language isn't Chinese.  So in this example we can
> algorithmically rule out that its Chinese.
>
> And what does Chinese really mean here?
>
>
Received on Wed Dec 27 2017 - 23:25:26 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 27 2017 - 23:25:26 CST