Re: Hanzi trad-simp folding and z-variants

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sat, 08 Jun 2013 01:51:57 -0700

As far as general folding is concerned, performing conversion (whether
it's word-based or not and even if it's locale-tailored) and then a
strict search will let you miss out on the z-variation you find in the
wild (because of true variation or of misspellings), and a more generous
inclusion of z-variation is in fact unlikely to give you false matches
(normally different words don't merely differ on the z-axis, though I
believe to remember having seen an example involving the name of a
historical term somewhere).

You are right about this point
> My point here was folding based on a character by character approach
> of traditional to simplified model would not make accurate word based
> retrieval from the resulting text easier but harder.
and the note on "transposition". But I also don't think this is the end
of the story: If you strictly convert on a word level, you will miss
(note that this point is different from what's in my first paragraph
above) those search results where your contextual conversion heuristics
was wrong. Perhaps a Classical Chinese character collocation agrees with
a modern Chinese term in simplified spelling but should be converted
"directly" instead of transposed when going from CN to TW. So for that
you'd need some sort of n-way expansion of a search query. I don't have
an example off the top of my head, but I don't think scenario is
unrealistic at all.

Stephan
Received on Sat Jun 08 2013 - 03:55:20 CDT

This archive was generated by hypermail 2.2.0 : Sat Jun 08 2013 - 03:55:21 CDT