Re: Japanese text handling problem in Unicode Collation Algorithm

From: Satoshi Nakagawa (psychs@limechat.net)
Date: Tue Oct 13 2009 - 09:48:24 CDT

Next message: karl williamson: "Default values for Bidi_Mirroring_Glyph"

Previous message: Erkki I. Kolehmainen: "[MEEK] Review of the CWA Draft till 9 November 2009 and the Final Meeting of the WS on 17 November 2009."
In reply to: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Kent Karlsson: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Kent Karlsson: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tue, Oct 13, 2009 at 6:16 AM, Kenneth Whistler <kenw@sybase.com> wrote:

> Nobody is disputing that they are not treated as the same characters
> in Japanese.

I will :)

> Note that for the purposes of weighting in the DUCET table, the only
> difference between "A" and "a" is their tertiary weights -- but it
> is quite clear to everyone that they are not the "same" characters
> in English or any other language. However, that distinction is not carried
> in the collation tables by forcing them to have primary weight distinctions.
>
> I think your point is that U+3063 and U+3064 are not alternate
> spellings in Japanese -- so they are lexically distinct in ways
> that case pairs of Latin letters typically are not. However, even
> for case differences, there are certainly lexical differences in
> English (and other languages) where uppercase versus lowercase
> are *not* optional, and do make systematic differences in meaning.
> See, for example, German, where systematic uppercasing of nouns is
> not optional, but a required aspect of spelling -- and where substituting
> one character for the other would be considered simply wrong.

I know the situation.

My point is the difference between small kana letters and big kana
letters is weaker than the difference between uppercase and lowercase
in latin alphabets.

You can see the fact in Google search.

[あつた]
http://www.google.com/search?q=%E3%81%82%E3%81%A4%E3%81%9F

[あった]
http://www.google.com/search?q=%E3%81%82%E3%81%A3%E3%81%9F

These two queries show completely different results, while [konig] and
[König] return the same results.

But maybe it's good to handle it with tailoring.

> Mark is not talking about the JIS X 0208 (or JIS X 0212 or JIS X 0213)
> character encoding standard. He is talking about the JIS X 4061-1996 Japanese
> sorting standard.
>
> And that standard does specify the distinction of small kana versus their
> large kana forms as a *third level* distinction in the sorting.
>
> More precisely:
>
> Level 1: The basic syllabic ordering:
>
> a i u e o ka ki ku ke ko ...
>
> Level 2: Diacritic ordering.
>
> A voiceless kana < voiced kana < semi-voiced (if exists)
>
> i.e. ka << ga, ha << ba << pa
>
> Level 3:
>
> Small kana <<< normal kana
>
> Level 4:
>
> Hiragana <<<< Katakana
>
> There is more to it than that, of course, including handling of the
> prolonged sound mark, and the iteration marks.
>
> But a pretty serious effort was made to get the DUCET table to
> match the JIS X 4601 specification as closely as is feasible, given
> the architectural constraints of UCA.
>
> So I'm going to disagree with the premise that the DUCET table per se
> is at fault here.

I have read JIS X 4061. Then I found the current UCA doesn't conform
to JIS X 4061 correctly.

In the spec, the 4 levels you wrote above are *not* levels, but 4
collation attributes. They are a set of rules for kana to compare 2
strings in 2 steps.

The spec said:

    If multiple collation attributes are given,
    evaluate one by one in the specified order
    until you have a result.
    (4.1 (2)(a) in p.4)

This means they should not be applied separately per the collation strength.

I have translated the spec as below.

----------------------------------------
JIS X 4061 p.3-4
--------------------
4.1 Fundamental collation rule

This rule collates 2 strings in the following steps.

(1) Make a base string and collate it

(a) Make a base string from a string

Convert a character to a corresponding base character
to form a base string.

  (b) Collate each character by the character class order
      defined in 4.3 and the order in a character class
      defined in 4.4. If the collation result is not equal,
      return it as a result here.
      ...

(2) If a result in (1) is equal, evaluate collation attributes.

(a) For each character in the source strings, evaluate
collation attributes as below.

      If the character classes are different, evaluate
      the character class order defined in 4.3 and return
      it as a result.

      If collation attributes are defined for the characters,
      evaluate using the collation attributes and return it
      as a result. If multiple collation attributes are given,
      evaluate one by one in the specified order
      until you have a result.
      ...

----------------------------------------
JIS X 4061 p.10
--------------------
4.4.10 Kana

...

4.4.10.2 Evaluate collation attributes

Evaluate collation attributes of kana in order as below.

(1) The first collation attribute (Diacritic marks)

voiceless kana < voiced kana < semi-voiced kana

(2) The second collation attribute

dash < small kana < reqeat mark < big kana

(3) The third collation attribute (Kana types)

hiragana < katakana

----------------------------------------

To conform to JIS X 4061 correctly, I would suggest:

(1) Apply only 4.1-(1) for the primary strength.
(2) Apply both 4.1-(1) and 4.1-(2) for the second strength. It means
applying collation attributes.
(3) Treat halfwidth katakana letters and normal katakana letters as
the same for the primary, the secondary and the tertiary strength.
(I'm not sure, but there should be a JIS X spec for this case.)

It assumes these strengths would be mapped to the current collation strengths.

But if so, we cannot use the (2) collation strength for Japanese text
and the primary collation strength for the other languages. This is
the same case as Google search described above.

IMHO, to make UCA more useful for the real use cases in Japan, these
strengths should be separated from the current system and make it as a
Japanese collation strength system. So we can specify the Japanese
collation strength and the current UCA collation strength separately.

--
Satoshi Nakagawa

Next message: karl williamson: "Default values for Bidi_Mirroring_Glyph"
Previous message: Erkki I. Kolehmainen: "[MEEK] Review of the CWA Draft till 9 November 2009 and the Final Meeting of the WS on 17 November 2009."
In reply to: Kenneth Whistler: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Next in thread: Kent Karlsson: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Reply: Kent Karlsson: "Re: Japanese text handling problem in Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 13 2009 - 09:52:59 CDT