Re: Hanzi trad-simp folding and z-variants

From: john knightley <>
Date: Sat, 8 Jun 2013 13:50:26 +0800

On Sat, Jun 8, 2013 at 11:55 AM, Stephan Stiller

> simplified [is] better thought of as abbreviated
> Part of this is a terminological argument. The historical situation is
> indeed more complicated than many people know, but the truth is also that
> irrespective of eg people's past or present usage in handwriting there have
> (in the past and esp in the present) been printing traditions which you can
> pinpoint by political region and time, occasionally by publisher.
> Regardless of what exactly happened during the pre-simplification era,
> there are fairly stable traditions now.

Merely offering an alternative translation of 简体. As you say the historical
situation is complex, however for "Simplified" as in the standard used in
mainland China is well defined. The situation also sends to be complex once
one steps putside of Putonghua.

> [quote approximate and adapted:]
> a ["]fully simplified["] passage of text will contain[] both simplified
> characters and those which have not been simplified [...] and therefore
> [be] tagged as traditional.
> This depends on the algorithm used for tagging. And note that tagging
> doesn't in fact have to be a *binary* classifier.†

Tautological, however the original email was referring using a such a
binary tagging system.

> working at character level is not the best way to go for your purposes,
> a larger units such as words or phrases produce much more meaningful
> results as this mimics the way a person reads Chinese, they do read process
> one character at a time rather word by word.
> I don't think JohnB was suggesting character-based retrieval. (I mean, who
> in his right mind would want to do letter-based (and post–case folding)
> retrieval for English documents? :-) Okay – just a joke, this analogy isn't
> any good.) But of course you're right to point out that simplification or
> the reverse operation (what's the term for that? "T-conversion" maybe?) is
> word- and context-dependent on the edges.

My point here was folding based on a character by character approach of
traditional to simplified model would not make accurate word based
retrieval from the resulting text easier but harder.

> A different point: I'm not suggesting imprecision, but people are partly
> used to this in text they've seen converted by those horrible tools you can
> find online for that purpose, and for some characters, people won't
> actually notice.
> Whilst the kZVariant field does mean that characters can, are
> frequently are transposed
> What do you mean by "transposed"? Could you give an example?

By transposed can sometimes be changed when going different traditions and
locales, it is not a one way street.

> it does not tell you when, also as said above the probability is that
> you have ordinary Chinese text written in the mainland style, folding based
> on the the kZVariant field, would either leave things unchanged or if it
> changed things would misspell words, that is the sounds, or in some cases
> appearance, would probably be similar, or homophones, but would not match
> any dictionaries.
> But if all occurrences of everything you process are folded (folding to
> lower-case is often done in NLP), this isn't a problem. Again, I'm not
> recommending this as best practice, I'm just pointing it out.
> There are Chinese compatibility characters in Unicode which if present
> which it probably would good to fold in but these are not in the scope of
> UniHan.
 My earlier statement about UniHan and compatibility variants was not
correct UniHan does have a kCompatibilityVariant field.

> And you remind me that z-variation is locale-dependent (see also †
> above). Anyways, I think it's hard to find examples of meaning-divergent
> z-variant words in modern Mandarin (MSM). I'm sure you or someone else will
> be able to quickly dig out examples, but really the question is what set of
> algorithms and data structures is best to address the general situation.
> Have locale-dependent folding tables? Allow a search term prefix that
> specifies "don't normalize or fold the following term"? Have secondary
> filters in your search that use a stricter model of character identity?
> does a good summary of the
possibilities. Trying to "fold" from one locale to another, which is what
folding from traditional to simplified would be is not a good idea, best
practice is not bear in mind the locale being used, and do information
retrieval on a locale by locale basis.

John Knightley

> Stephan
Received on Sat Jun 08 2013 - 00:54:56 CDT

This archive was generated by hypermail 2.2.0 : Sat Jun 08 2013 - 00:54:58 CDT