Re: Japanese text handling problem in Unicode Collation Algorithm

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Oct 13 2009 - 05:00:46 CDT

  • Next message: Erkki I. Kolehmainen: "[MEEK] Review of the CWA Draft till 9 November 2009 and the Final Meeting of the WS on 17 November 2009."

    On 10/12/2009 6:38 PM, Deborah Goldsmith wrote:
    >> However, even
    >> for case differences, there are certainly lexical differences in
    >> English (and other languages) where uppercase versus lowercase
    >> are *not* optional, and do make systematic differences in meaning.
    >> See, for example, German, where systematic uppercasing of nouns is
    >> not optional, but a required aspect of spelling -- and where substituting
    >> one character for the other would be considered simply wrong.
    >>
    >
    > That’s true, but Japanese is the only language that uses kana, so it’s not clear why the DUCET behavior isn’t language-appropriate without tailoring. There may well be a reason, but it’s not obvious that it has to be the way it is.
    >
    >
    So far, that sounds defensible as a principle. However, it would need to
    be established that the results of a sorted listbox are incorrect.
    Instead, the complaint is based on incorrect *search* behavior. This
    points to a limitation of building a search algorithm directly on collation.

    That is a problem that can occur for other languages. Distinction
    ignored in sorting bring words into proximity based on their spelling,
    not their meaning. In searching, by ignoring tertiary differences, the
    result would not be proximity, but a match. As the current case
    highlights, that can lead to false matches, which are annoying,
    especially if they are patently false matches.

    Collation may be a starting point to identify large classes of ignorable
    differences for searching, but applying this without tailoring that is
    specific to the search task is incorrect.

    Ken pointed that out already, but it was apparently well hidden in his
    reply.

    A./



    This archive was generated by hypermail 2.1.5 : Tue Oct 13 2009 - 05:42:12 CDT