Re: Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (psychs@limechat.net)
Date: Mon Oct 12 2009 - 14:29:00 CDT

  • Next message: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"

    I have checked the Unicode CLDR collation data, but it contains data
    only for the tertiary strength.

    IMHO, for example, [っ] (U+3063) and [つ] (U+3064) shoule be treated as
    different characters in the primary strength. Because these are never
    treated as the same characters in Japanese, even if these have similar
    gryphs.

    I would suggest to mofidy the Default Unicode Collation Element Table.

    In http://www.unicode.org/Public/UCA/latest/allkeys.txt,

    3063 ; [.27B0.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    30C3 ; [.27B0.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    FF6F ; [.27B0.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM

    this part specifies [っ] (U+3063) and [つ] (U+3064) are treated as the
    same character in the primary strength and the secondary strength.

    My suggestion would be like this.

    3063 ; [.3267.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    30C3 ; [.3267.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    FF6F ; [.3267.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM

    Then [っ] (U+3063) and [つ] (U+3064) are always treated as different characters.

    And not only [っ] and [つ], all character pairs in my last mail should
    be also modified as well.

    Note that the JIS standard didn't tell about collation algorithm and
    sorting order as far as I know.

    --
    Satoshi Nakagawa
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 14:34:00 CDT