RE: Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 15 2003 - 19:12:55 EST

  • Next message: Doug Ewell: "Re: [OT] CJK -> CJC (Re: Corea?)"

    Markus Scherer wrote:
    > It still comes back to what Doug said: The default rules make
    > sense for most languages, while in
    > order to make sense for Turkic languages, you must use special
    > rules for them. There is no way
    > around it - it comes from the fact that they use the same letters
    > in a different way.

    You have not read: I'm not interested in the Turkic case, but in NON Turkic
    languages, exactly with the default rule which:
    - does not differentiate the dotted uppercase I and the undotted uppercase I
    when casefolding them to the SAME soft-dotted lowercase i.
    - but DOES differentiate the soft-dotted lowercase i and the dotless
    lowercase i, despite the uppercase mapping will drop that difference!

    This means, for non Turkic languages or in the locale-neutral environment,
    that despite two characters are distinct when case folded, this difference
    is not kept when converting to uppercase.

    Such problem does not occur when using the Turkic case folding rules, so
    Turkish and Azeri names don't have this problem with the lowercase dotless
    i!

    So it's a consistency problem; even for German we already have:
            LocaleNeutralFullCaseFolding(<Ess-Tsett>) =
                    <Small S, Small S>;
            LowerCase(UpperCase(<Ess Tsett>)) =
                    LowerCase(<Capital S, Capital S>) =
                    <Small S, Small S>;
            Both results are equal as expected.
    and:
            TurkicFullCaseFolding(<Small Dotless I>) =
                    <Small Dotless I>
            LowerCase(UpperCase(<Small Dotless I>)) =
                    LowerCase(<Capital (dotless) I>) =
                    <Small Dotless I>
            Both results are equal as expected.
    but:
            LocaleNeutralFullCaseFolding(<Small Dotless I>) =
                    <Small Dotless I>
            LowerCase(UpperCase(<Small Dotless I>)) =
                    LowerCase(<Capital (dotless) I>) =
                    <Small (soft-dotted) I>
            Both results are unexpectedly different;

    The last two results would be identical as expected, if we had a rule in
    CaseFolding.txt so that:
            LocaleNeutralFullCaseFolding(<Small Dotless I>) =
                    <Small (soft-dotted) I>
    And this rule is possible without even breaking the current rules Turkic
    languages by just adding these two lines in CaseFoldings.txt:

    0130; F; 0069; # LATIN SMALL LETTER DOTLESS I
    0130; T; 0130; # LATIN SMALL LETTER DOTLESS I

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Mon Dec 15 2003 - 20:29:48 EST