Re: Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (
Date: Tue Oct 13 2009 - 09:48:24 CDT

  • Next message: karl williamson: "Default values for Bidi_Mirroring_Glyph"

    On Tue, Oct 13, 2009 at 6:16 AM, Kenneth Whistler <> wrote:

    > Nobody is disputing that they are not treated as the same characters
    > in Japanese.

    I will :)

    > Note that for the purposes of weighting in the DUCET table, the only
    > difference between "A" and "a" is their tertiary weights -- but it
    > is quite clear to everyone that they are not the "same" characters
    > in English or any other language. However, that distinction is not carried
    > in the collation tables by forcing them to have primary weight distinctions.
    > I think your point is that U+3063 and U+3064 are not alternate
    > spellings in Japanese -- so they are lexically distinct in ways
    > that case pairs of Latin letters typically are not. However, even
    > for case differences, there are certainly lexical differences in
    > English (and other languages) where uppercase versus lowercase
    > are *not* optional, and do make systematic differences in meaning.
    > See, for example, German, where systematic uppercasing of nouns is
    > not optional, but a required aspect of spelling -- and where substituting
    > one character for the other would be considered simply wrong.

    I know the situation.

    My point is the difference between small kana letters and big kana
    letters is weaker than the difference between uppercase and lowercase
    in latin alphabets.

    You can see the fact in Google search.



    These two queries show completely different results, while [konig] and
    [König] return the same results.

    But maybe it's good to handle it with tailoring.

    > Mark is not talking about the JIS X 0208 (or JIS X 0212 or JIS X 0213)
    > character encoding standard. He is talking about the JIS X 4061-1996 Japanese
    > sorting standard.
    > And that standard does specify the distinction of small kana versus their
    > large kana forms as a *third level* distinction in the sorting.
    > More precisely:
    > Level 1: The basic syllabic ordering:
    > a i u e o ka ki ku ke ko ...
    > Level 2: Diacritic ordering.
    > A voiceless kana < voiced kana < semi-voiced (if exists)
    > i.e. ka << ga, ha << ba << pa
    > Level 3:
    > Small kana <<< normal kana
    > Level 4:
    > Hiragana <<<< Katakana
    > There is more to it than that, of course, including handling of the
    > prolonged sound mark, and the iteration marks.
    > But a pretty serious effort was made to get the DUCET table to
    > match the JIS X 4601 specification as closely as is feasible, given
    > the architectural constraints of UCA.
    > So I'm going to disagree with the premise that the DUCET table per se
    > is at fault here.

    I have read JIS X 4061. Then I found the current UCA doesn't conform
    to JIS X 4061 correctly.

    In the spec, the 4 levels you wrote above are *not* levels, but 4
    collation attributes. They are a set of rules for kana to compare 2
    strings in 2 steps.

    The spec said:

        If multiple collation attributes are given,
        evaluate one by one in the specified order
        until you have a result.
        (4.1 (2)(a) in p.4)

    This means they should not be applied separately per the collation strength.

    I have translated the spec as below.

    JIS X 4061 p.3-4
    4.1 Fundamental collation rule

    This rule collates 2 strings in the following steps.

    (1) Make a base string and collate it

      (a) Make a base string from a string

          Convert a character to a corresponding base character
          to form a base string.

      (b) Collate each character by the character class order
          defined in 4.3 and the order in a character class
          defined in 4.4. If the collation result is not equal,
          return it as a result here.

    (2) If a result in (1) is equal, evaluate collation attributes.

      (a) For each character in the source strings, evaluate
          collation attributes as below.

          If the character classes are different, evaluate
          the character class order defined in 4.3 and return
          it as a result.

          If collation attributes are defined for the characters,
          evaluate using the collation attributes and return it
          as a result. If multiple collation attributes are given,
          evaluate one by one in the specified order
          until you have a result.

    JIS X 4061 p.10
    4.4.10 Kana

           ... Evaluate collation attributes

        Evaluate collation attributes of kana in order as below.

    (1) The first collation attribute (Diacritic marks)

        voiceless kana < voiced kana < semi-voiced kana

    (2) The second collation attribute

        dash < small kana < reqeat mark < big kana

    (3) The third collation attribute (Kana types)

        hiragana < katakana


    To conform to JIS X 4061 correctly, I would suggest:

    (1) Apply only 4.1-(1) for the primary strength.
    (2) Apply both 4.1-(1) and 4.1-(2) for the second strength. It means
    applying collation attributes.
    (3) Treat halfwidth katakana letters and normal katakana letters as
    the same for the primary, the secondary and the tertiary strength.
    (I'm not sure, but there should be a JIS X spec for this case.)

    It assumes these strengths would be mapped to the current collation strengths.

    But if so, we cannot use the (2) collation strength for Japanese text
    and the primary collation strength for the other languages. This is
    the same case as Google search described above.

    IMHO, to make UCA more useful for the real use cases in Japan, these
    strengths should be separated from the current system and make it as a
    Japanese collation strength system. So we can specify the Japanese
    collation strength and the current UCA collation strength separately.

    Satoshi Nakagawa

    This archive was generated by hypermail 2.1.5 : Tue Oct 13 2009 - 09:52:59 CDT