Re: Hanzi trad-simp folding and z-variants from john knightley on 2013-06-08 (Unicode Mail List Archive)

From: john knightley <john.knightley_at_gmail.com>
Date: Sun, 9 Jun 2013 11:09:04 +0800

On Sat, Jun 8, 2013 at 9:00 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:

> I.
>
> Which and where?
>>
> Section 3.7.1 Simplified and Traditional Chinese Variants talks about
> converting between Simplified and Traditional Chinese.
>
> You wrote this
>
> http://www.unicode.org/reports/tr38/ does a good summary of the
>>> possibilities.
>>>
>> in response to my inquiry about "examples of meaning-divergent
> z-variant words in modern Mandarin" and appropriate "algorithms and data
> structures". Also, the Unihan database doesn't provide collocational data
> for T/S conversion.
>
>
So we both agree that Unihan is not designed to tell people how to covert
between traditional and simplified characters. Though some confusion as
what other questions are being discussed here.

>
> II.
>
>
> simplification is also found in for example Japanese CJK ideographs
> which is documented
>
> Contextual conversion (and shifting/"transposition") is essentially not an
> issue in this context, even though you have an odd case of deviation here
> and there.
>
>
Japanese has well established traditions for simplifying CJK ideographs
which are not identical to Chinese if one was to use a folding approach to
deal with simplifications then there should be differences for Chinese and
Japanese.

>
> Some dialects such as Cantonese are quite well documented
>
> [and]
>
> There is an increased interest in such things in recent years. One
> persons 'hand-tuned' of today can become the basis of a standard of
> tomorrow.
>
>
> 1a. I'd say I have a decent grasp of the topic of lexical variation for
> written Cantonese, based on a decent amount of fieldwork. (While we're at
> it, I also know at least one researcher with an interest in standardization
> of Cantonese spelling.) I'm certain that lexical variation in Cantonese is
> not well-documented, though there are a bunch of sources from which you can
> scrap your own thing together.
>

"quite well documented" is a relative term, after Mandarin, Cantonese is
one of the better documented of the Chinese dialects, and better documented
than the use of CJK ideographs for other languages such as say Zhuang
Sawndip my primary are of research. That is not to say there is not more
work to be done on this area in Cantonese.

> 1b. Keep in mind that most materials in electronic form (originally
> written in this form or digitized) don't use the "best" character choices –
> needless to say it's gotta be even truer for other Sinitic languages.
>

By best choice do you mean (a) the person producing the electronic form
was unable to use the character they wished
because either it is not yet in Unicode (b) even though in Unicode the
person was did not know how to type it so type another character instead
(c) a less than perfect, or ambiguous, 'spelling' . All of which are
found both for Sinitic languages and non-Sinitic languages when written in
CJK ideographs, be it printed publications, web-pages or text messages
between native speakers.

> 2. This is entirely unrelated to the question of whether one can or should
> describe simplified characters as "abbreviated". There is a connection to
> your statement about things being on a sliding scale (you used the word
> "relative"), but for Cantonese it's more like this translates into a lot of
> inconsistency between using genuine C spelling, a M substitute, a C-based
> phonetic transcription, ad-hoc usage using the mouth radical or a prefixed
> roman "o", an English-based informal transcription using Latin letters, and
> avoidance. Whether this is electronically manageable in principle depends
> on whether you include entirely romanized blogs (which I wouldn't
> recommend), but – in any case – anything other than liberal QE (query
> expansion) will *not* work. (I might previously have misused the word
> "folding" to mean "conversion".)
>

The "this" here is not to clear to me. However the features you describe
for Cantonese are also found in Zhuang texts, these where however not what
I meant by "abbreviated" . As to variants in general yes the scale is wide,
and to a degree dependent upon the locale. Perhaps my email was not clear
either, however I think we where using folding in the same way, namely a
step to be taken before either searching based on a word list or
dictionary, conversion to a romanized script or text to speech .

> 3. Other Sinitic languages are essentially not at all standardized (we're
> talking Chinese characters here, not romanizations). Last time I checked it
> seemed like Taiwanese is a total mess, and Shanghainese has a (mainland-CN)
> researcher who is (still) writing a dictionary to actually find or document
> written representations of all syllable-"morphemes" to capture all of
> SHnese. The best SHnese textbook was published a couple of years ago in HK
> and uses traditional characters (!) to represent modern SHnese.
>
>
Not standardize does not mean totally beyond analysis or processing,
or even necessarily that confusing to a native speaker, they are not
random, though admittedly more complex than a standardized locale.

John

> Stephan
>
>
Received on Sat Jun 08 2013 - 22:15:20 CDT

This archive was generated by hypermail 2.2.0 : Sat Jun 08 2013 - 22:15:22 CDT