Re: Hanzi trad-simp folding and z-variants

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 07 Jun 2013 20:55:43 -0700

> simplified [is] better thought of as abbreviated
Part of this is a terminological argument. The historical situation is
indeed more complicated than many people know, but the truth is also
that irrespective of eg people's past or present usage in handwriting
there have (in the past and esp in the present) been printing traditions
which you can pinpoint by political region and time, occasionally by
publisher. Regardless of what exactly happened during the
pre-simplification era, there are fairly stable traditions now.

[quote approximate and adapted:]
> a ["]fully simplified["] passage of text will contain[] both
> simplified characters and those which have not been simplified [...]
> and therefore [be] tagged as traditional.
This depends on the algorithm used for tagging. And note that tagging
doesn't in fact have to be a /binary/ classifier.†

> working at character level is not the best way to go for your
> purposes, a larger units such as words or phrases produce much more
> meaningful results as this mimics the way a person reads Chinese, they
> do read process one character at a time rather word by word.
I don't think JohnB was suggesting character-based retrieval. (I mean,
who in his right mind would want to do letter-based (and post–case
folding) retrieval for English documents? :-) Okay – just a joke, this
analogy isn't any good.) But of course you're right to point out that
simplification or the reverse operation (what's the term for that?
"T-conversion" maybe?) is word- and context-dependent on the edges.

A different point: I'm not suggesting imprecision, but people are partly
used to this in text they've seen converted by those horrible tools you
can find online for that purpose, and for some characters, people won't
actually notice.

> Whilst the kZVariant field does mean that characters can, are
> frequently are transposed
What do you mean by "transposed"? Could you give an example?

> it does not tell you when, also as said above the probability is that
> you have ordinary Chinese text written in the mainland style, folding
> based on the the kZVariant field, would either leave things unchanged
> or if it changed things would misspell words, that is the sounds, or
> in some cases appearance, would probably be similar, or homophones,
> but would not match any dictionaries.
But if all occurrences of everything you process are folded (folding to
lower-case is often done in NLP), this isn't a problem. Again, I'm not
recommending this as best practice, I'm just pointing it out.

> There are Chinese compatibility characters in Unicode which if present
> which it probably would good to fold in but these are not in the scope
> of UniHan.
And you remind me that z-variation is locale-dependent (see also †
above). Anyways, I think it's hard to find examples of meaning-divergent
z-variant words in modern Mandarin (MSM). I'm sure you or someone else
will be able to quickly dig out examples, but really the question is
what set of algorithms and data structures is best to address the
general situation. Have locale-dependent folding tables? Allow a search
term prefix that specifies "don't normalize or fold the following term"?
Have secondary filters in your search that use a stricter model of
character identity?

Stephan
Received on Fri Jun 07 2013 - 23:00:32 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 07 2013 - 23:00:38 CDT