Re: Hanzi trad-simp folding and z-variants from john knightley on 2013-06-07 (Unicode Mail List Archive)

From: john knightley <john.knightley_at_gmail.com>
Date: Sat, 8 Jun 2013 08:02:11 +0800

Resending email: Originally sent by mistake just to sender and not to list.

Dear John,

Without looking at your texts it I can not say for certain, however it
should be noted that simplified, perhaps better thought of as abbreviated,
is a relative term, therefore a fully simplified passage of text will
contained both simplified characters and those which have not been
simplified, that is abbreviated, and therefore tagged as traditional.

The situation regarding Chinese documents is somewhat more complicated,
working at character level is not the best way to go for your purposes, a
larger units such as words or phrases produce much more meaningful results
as this mimics the way a person reads Chinese, they do read process one
character at a time rather word by word. Whilst the kZVariant field does
mean that characters can, are frequently are transposed it does not tell
you when, also as said above the probability is that you have ordinary
Chinese text written in the mainland style, folding based on the the
kZVariant field, would either leave things unchanged or if it changed
things would misspell words, that is the sounds, or in some cases
appearance, would probably be similar, or homophones, but would not match
any dictionaries.

For information retrieval from Chinese documents you require a list of
words or phrases that you are looking for as a minimum, and in simple terms
the longer the phrase the more likely for the match to be correct. How
long, hard to say, it really depends on what information you are looking
for, a list of words such as 现代汉语常用词表 has over 50 thousand words in it, a
list with phrases would be longer.

In short such a folding algorithm based on kZVariant would not be a good
idea. There are Chinese compatibility characters in Unicode which if
present which it probably would good to fold in but these are not in the
scope of UniHan.

Regards
John Knightley

On Sat, Jun 8, 2013 at 4:00 AM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:

> Hi John,
>
> This is one of those questions that I've been wondering about as well ...
> my guess would be "yes that should work (and dealing with z-variants is
> something you'll likely need to do anyways)", but there *must* be some
> published algorithm out there that specifically addresses the issue of
> diffferentiable and recoverable folding for indexing.
>
> This comes up in NLP all the time for case folding. My impression is that
> the folks there just fold everything into lowercase and later apply a
> so-called truecasing algorithm (aka truecaser). To someone like me this
> just seems like totally the wrong approach, but I'll be open to be
> convinced otherwise with the right empirical arguments.
>
> If you find some information on data structures and algorithms tailored to
> this problem in the area of indexing/querying, let me know.
>
> Stephan
>
>
>
> On 6/6/2013 12:54 PM, John D. Burger wrote:
>
> Hi there -
>
> I'm working on an information retrieval application for a collection of Chinese documents, which appear to use a mix of traditional and simplified characters. My intuition is that it makes sense to do traditional to simplified folding for indexing and query processing (when the mapping is unambiguous), but I'd be interested in opinions about this.
>
> Second, I just noticed the kZVariant field in the Unihan.zip file. It seems to me that it makes sense to fold these together as well, correct?
>
> Thanks for any information you care to provide.
>
> - John Burger
> MITRE
>
>
>
>
Received on Fri Jun 07 2013 - 19:08:11 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 07 2013 - 19:08:13 CDT