Re: Contradiction in casing information in Unicode official sources.

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Apr 09 2008 - 13:34:52 CDT

  • Next message: Kenneth Whistler: "Re: "French+" support by Unicode"

    Jim Allan said:

    > According to Unicode specifications from Unicode version 1.0 up to the
    > current version of Unicode, the character U+026A LATIN LETTER SMALL
    > CAPITAL I (ɪ) capitalizes as U+0197 LATIN CAPITAL LETTER I WITH STROKE (Ɨ).

    That is an incorrect statement of the Unicode specification.

    "Capitalizes as" is defined by the Simple_Uppercase_Mapping field
    (field #12) in UnicodeData.txt.

    For Unicode 5.1, we have:

    0197;LATIN CAPITAL LETTER I WITH STROKE;Lu;...;;0268;
    0268;LATIN SMALL LETTER I WITH STROKE;Ll;...;0197;;0197
    026A;LATIN LETTER SMALL CAPITAL I;Ll;...;;;

    What that says is that U+0268 (*not* U+026A) has a Simple_Uppercase_Mapping
    to U+0197.

    And those case mapping values have remained unchanged since Unicode 2.0,
    when they were first made available in machine-readable form in
    UnicodeData.txt.

    > See the official Unicode charts for the IPA Extension at
    > http://www.unicode.org/charts/PDF/U0250.pdf .
    >
    > Under U+026A ɪ LATIN SMALL LETTER CAPITAL I the charts state:
    > “→ 0197 Ɨ Latin capital letter i with stroke”.

    A cross-reference annotation in the Unicode names list does not
    define a case mapping, and never has. It may refer to a
    character which is in a case mapping relation to the character
    which is annotated, but that is only one of many possible meanings
    of a cross-reference. Most commonly the cross-reference simply
    means: "may appear confusingly similar to character XXXX". See
    "Cross References" on p. 566 of TUS 5.0.

    >
    > Under U+0268 ɨ LATIN SMALL LETTER I WITH STROKE the charts state:
    > “• ISO 6438 gives lowercase of 0197 Ɨ as 026A ɪ not 0268 ɨ”.

    That was done to recognize the fact the ISO 6438 specifies a
    different case mapping than the Unicode Standard does.

    > But the Unicode case folding table at
    > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt has long disagreed.

    No, actually. It has long agreed, and is completely consistent
    with the Simple_Uppercase_Mapping value for U+0268 (and the
    Simple_Lowercase_Mapping for U+1097).

    > To summarize, the position in the casefolding table is:
    > U+026A ɪ LATIN SMALL LETTER CAPITAL I does not case
    > U+0268 ɨ LATIN SMALL LETTER I WITH STROKE uppercases to U+0197 LATIN
    > CAPITAL LETTER I WITH STROKE (Ɨ).

    That is correct.

    >
    > The position in the Unicode printed material is:
    > U+0268 ɨ LATIN SMALL LETTER I WITH STROKE does not appear in the table
    > so therefore dos not case.
    > U+026A ɪ LATIN SMALL LETTER CAPITAL I uppercases to U+0197 LATIN CAPITAL
    > LETTER I WITH STROKE (Ɨ).

    And that is a misinterpretation of the Unicode names list.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Apr 09 2008 - 13:37:35 CDT