Re: Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (psychs@limechat.net)
Date: Mon Oct 12 2009 - 14:29:00 CDT

Next message: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"

Previous message: Charlie Ruland ☘: "Error in UTF #10"
In reply to: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I have checked the Unicode CLDR collation data, but it contains data
only for the tertiary strength.

IMHO, for example, [っ] (U+3063) and [つ] (U+3064) shoule be treated as
different characters in the primary strength. Because these are never
treated as the same characters in Japanese, even if these have similar
gryphs.

I would suggest to mofidy the Default Unicode Collation Element Table.

In http://www.unicode.org/Public/UCA/latest/allkeys.txt,

3063 ; [.27B0.0020.000D.3063] # HIRAGANA LETTER SMALL TU
3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
30C3 ; [.27B0.0020.000F.30C3] # KATAKANA LETTER SMALL TU
FF6F ; [.27B0.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM

this part specifies [っ] (U+3063) and [つ] (U+3064) are treated as the
same character in the primary strength and the secondary strength.

My suggestion would be like this.

3063 ; [.3267.0020.000D.3063] # HIRAGANA LETTER SMALL TU
3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
30C3 ; [.3267.0020.000F.30C3] # KATAKANA LETTER SMALL TU
FF6F ; [.3267.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM

Then [っ] (U+3063) and [つ] (U+3064) are always treated as different characters.

And not only [っ] and [つ], all character pairs in my last mail should
be also modified as well.

Note that the JIS standard didn't tell about collation algorithm and
sorting order as far as I know.

--
Satoshi Nakagawa

Next message: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Previous message: Charlie Ruland ☘: "Error in UTF #10"
In reply to: Mark Davis ☕: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 14:34:00 CDT