Re: Japanese text handling problem in Unicode Collation Algorithm

From: Deborah Goldsmith (goldsmit@apple.com)
Date: Mon Oct 12 2009 - 20:38:15 CDT

  • Next message: Asmus Freytag: "Re: Japanese text handling problem in Unicode Collation Algorithm"

    > However, even
    > for case differences, there are certainly lexical differences in
    > English (and other languages) where uppercase versus lowercase
    > are *not* optional, and do make systematic differences in meaning.
    > See, for example, German, where systematic uppercasing of nouns is
    > not optional, but a required aspect of spelling -- and where substituting
    > one character for the other would be considered simply wrong.

    That’s true, but Japanese is the only language that uses kana, so it’s not clear why the DUCET behavior isn’t language-appropriate without tailoring. There may well be a reason, but it’s not obvious that it has to be the way it is.

    Debbie

    On Oct 12, 2009, at 2:16 PM, Kenneth Whistler wrote:

    > Satoshi Nakagawa said:
    >
    >> I have checked the Unicode CLDR collation data, but it contains data
    >> only for the tertiary strength.
    >>
    >> IMHO, for example, [っ] (U+3063) and [つ] (U+3064) shoule be treated as
    >> different characters in the primary strength. Because these are never
    >> treated as the same characters in Japanese, even if these have similar
    >> gryphs.
    >
    > Nobody is disputing that they are not treated as the same characters
    > in Japanese.
    >
    > Note that for the purposes of weighting in the DUCET table, the only
    > difference between "A" and "a" is their tertiary weights -- but it
    > is quite clear to everyone that they are not the "same" characters
    > in English or any other language. However, that distinction is not carried
    > in the collation tables by forcing them to have primary weight distinctions.
    >
    > I think your point is that U+3063 and U+3064 are not alternate
    > spellings in Japanese -- so they are lexically distinct in ways
    > that case pairs of Latin letters typically are not. However, even
    > for case differences, there are certainly lexical differences in
    > English (and other languages) where uppercase versus lowercase
    > are *not* optional, and do make systematic differences in meaning.
    > See, for example, German, where systematic uppercasing of nouns is
    > not optional, but a required aspect of spelling -- and where substituting
    > one character for the other would be considered simply wrong.
    >
    >>
    >> I would suggest to mofidy the Default Unicode Collation Element Table.
    >>
    >> In http://www.unicode.org/Public/UCA/latest/allkeys.txt,
    >>
    >> 3063 ; [.27B0.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    >> 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    >> 30C3 ; [.27B0.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    >> FF6F ; [.27B0.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    >> 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    >> FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    >> 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    >> 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    >> 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
    >>
    >> this part specifies [っ] (U+3063) and [つ] (U+3064) are treated as the
    >> same character in the primary strength and the secondary strength.
    >>
    >> My suggestion would be like this.
    >>
    >> 3063 ; [.3267.0020.000D.3063] # HIRAGANA LETTER SMALL TU
    >> 3064 ; [.27B0.0020.000E.3064] # HIRAGANA LETTER TU
    >> 30C3 ; [.3267.0020.000F.30C3] # KATAKANA LETTER SMALL TU
    >> FF6F ; [.3267.0020.0010.FF6F] # HALFWIDTH KATAKANA LETTER SMALL TU; QQK
    >> 30C4 ; [.27B0.0020.0011.30C4] # KATAKANA LETTER TU
    >> FF82 ; [.27B0.0020.0012.FF82] # HALFWIDTH KATAKANA LETTER TU; QQK
    >> 32E1 ; [.27B0.0020.0013.32E1] # CIRCLED KATAKANA TU; QQK
    >> 3065 ; [.27B0.0020.000E.3064][.0000.018B.0002.3099] # HIRAGANA LETTER DU; QQCM
    >> 30C5 ; [.27B0.0020.0011.30C4][.0000.018B.0002.3099] # KATAKANA LETTER DU; QQCM
    >>
    >> Then [っ] (U+3063) and [つ] (U+3064) are always treated as different characters.
    >
    > Those particular weights would end up with a collation order completely
    > unlike what you show there -- as a primary weight of "3267" for the
    > small tu letters (for that UCA 5.1 version of the table) would result in
    > small tu sorting after *every* other Japanese syllable (after te, after to,
    > after na, .... after wa... -- indeed after every other character in
    > every scripts except Han by default).
    >
    >> And not only [っ] and [つ], all character pairs in my last mail should
    >> be also modified as well.
    >>
    >> Note that the JIS standard didn't tell about collation algorithm and
    >> sorting order as far as I know.
    >
    > Mark is not talking about the JIS X 0208 (or JIS X 0212 or JIS X 0213)
    > character encoding standard. He is talking about the JIS X 4061-1996 Japanese
    > sorting standard.
    >
    > And that standard does specify the distinction of small kana versus their
    > large kana forms as a *third level* distinction in the sorting.
    >
    > More precisely:
    >
    > Level 1: The basic syllabic ordering:
    >
    > a i u e o ka ki ku ke ko ...
    >
    > Level 2: Diacritic ordering.
    >
    > A voiceless kana < voiced kana < semi-voiced (if exists)
    >
    > i.e. ka << ga, ha << ba << pa
    >
    > Level 3:
    >
    > Small kana <<< normal kana
    >
    > Level 4:
    >
    > Hiragana <<<< Katakana
    >
    > There is more to it than that, of course, including handling of the
    > prolonged sound mark, and the iteration marks.
    >
    > But a pretty serious effort was made to get the DUCET table to
    > match the JIS X 4601 specification as closely as is feasible, given
    > the architectural constraints of UCA.
    >
    > So I'm going to disagree with the premise that the DUCET table per se
    > is at fault here.
    >
    > The issue, instead, seems to be that since ICU collation is built directly
    > on UCA and since certain open source (and proprietary) applications are
    > then built directly on ICU, they surface behavior that may not be optimal
    > for searching (or sorting) for all languages.
    >
    > And that actually is not too surprising, either, because DUCET is not
    > designed to provide optimal behavior for any given language without
    > tailoring.
    >
    > But in the case of Japanese, the issue for you seems to boil down to
    > the fact that a search on a Japanese string in Safari doesn't
    > distinguish between small and large kana. That amounts to a mistaken
    > assumption (IMO) that a tertiary distinction is not important in
    > distinguishing search terms for Japanese. In other words, because
    > the search is built on an ICU collator set to ignore tertiary distinctions
    > (i.e. it is effectively "case-folding" for matches), it is giving
    > false positive matches where you think it shouldn't.
    >
    > There are various ways to handle this, including tailoring and
    > language or script-specific differences in handling tertiary distinctions
    > for the purposes of search terms. But it seems clear to me that
    > it should *not* be "fixed" by changing the DUCET table at this point --
    > as that would be guaranteed to upset actual collation and sorting
    > by other applications, as well as disrupting the basis for any
    > Japanese tailorings for UCA that may already exist.
    >
    > --Ken
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Oct 12 2009 - 20:41:21 CDT