Re: Hanzi trad-simp folding and z-variants from Stephan Stiller on 2013-06-09 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sat, 08 Jun 2013 22:26:04 -0700

> So we both agree that Unihan is not designed to tell people how to
> covert between traditional and simplified characters.
Yep.

> Though some confusion as what other questions are being discussed here.
I think I misused the expression "folding" at some point. But the
original query explicitly asked about "do[ing] traditional to simplified
folding for indexing and query processing (/when the mapping is
unambiguous/)" (emph added) so I wasn't really sure where parts of the
discussion were going :-)

> Japanese has well established traditions for simplifying CJK
> ideographs which are not identical to Chinese if one was to use a
> folding approach to deal with simplifications then there should be
> differences for Chinese and Japanese.
I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the
entries of 廣 (U+5EE3) and the characteristically Japanese character 広
(U+5E83).) I know that certain contexts retain older forms (KenL talks
about this somewhere too). Btw if you know about other mappings or good
resources, I'll be curious to know.

> "quite well documented" is a relative term
I highly respect the work in Cheung & Bauer, but it makes no attempt to
tell us how easily understood the characters are. Many of them are
ad-hoc coinages that are not understood by any of my informants;
sometimes for say 6 ways of writing a syllable-morpheme, I can make my
informants tell me that perhaps /one/ of them is passable. This problem
isn't easily solved, but then the source isn't helpful in knowing which
out of the approx 1000 characters are actually used nowadays. I won't
give you a number, as I'd have to check more carefully to be quotable.
The number of morphemes for which there truly seems to be no written
representation is /very/ low, but often the characters in existence
aren't exactly comprehensible to many native speakers either, and not
all of them are unambiguous. This will give you an idea.

> Zhuang Sawndip
Sounds exciting.

> By best choice do you mean (a) the person producing the electronic
> form was unable to use the character they wished
> because either it is not yet in Unicode (b) even though in Unicode
> the person was did not know how to type it so type another character
> instead (c) a less than perfect, or ambiguous, 'spelling' . All of
> which are found both for Sinitic languages and non-Sinitic languages
> when written in CJK ideographs, be it printed publications, web-pages
> or text messages between native speakers.
Nearly all of Cantonese is in Unicode and therefore typeable in theory
(though some people will not be used to such writing, but I'm sure you
know this), so it's not (a). I would say it's largely (c) (people will
often make up their own plausible thing), even though (b) is a reason too.

> Not standardize does not mean totally beyond analysis or processing,
> or even necessarily that confusing to a native speaker, they are not
> random, though admittedly more complex than a standardized locale.
Yes. And we both agree that standardization is desirable.

Stephan
Received on Sun Jun 09 2013 - 00:30:34 CDT

This archive was generated by hypermail 2.2.0 : Sun Jun 09 2013 - 00:30:36 CDT