RE: Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 16 2003 - 05:03:40 EST

  • Next message: Michael Everson: "Re: Stability of WG2 (was: Re: [OT] CJK -> CJC)"

    Doug Ewell <dewell@adelphia.net> writes:
    > > Wrong here: I have found occurences of dotless lowercase i, used
    > > instead of soft-dotted lowercase i, as base letters for diacritics
    > > added above it (it was an accute accent...)
    >
    > Don't do that.

    What? This is VALID UNICODE to have texts coded like this. The proposed
    change for soft-dotted/dotless letters used with diacritics is still not in
    the standard, and it just gives rendering hints so that both base letters
    should have the same rendering, requiring the use of a explicit dot when the
    soft dot muct be kept with the diacritic.

    > > There was two sequences which looked apparently identical when
    > > rendered, and that were distinct after case folding compare check:
    > >
    > > (1) LATIN SMALL LETTER I, COMBINING ACCUTE ACCENT
    > > (2) LATIN SMALL LETTER DOTLESS I, COMBINING ACCUTE ACCENT
    > >
    > > but were no more distinct when converted to uppercase in a locale
    > > neutral environment not using the Turkic rules:
    > >
    > > (1') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
    > > (2') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
    >
    > OK, so you want the default, local-neutral case mapping tables to equate
    > U+0069 with U+0131, right?

    Yes. And I have good reasons for that, coming from the fact that default
    locale-neutral mappings tables already equate their uppercase versions U+049
    with U+0130, by returning U+0069 for both of them.

    > This is close to being a spoofing problem, though. See TUS 4.0, page
    > 141.

    If you think this is a spoofing problem, then the existing locale-neutral
    full case mapping of U+0130 is bogous and should not be U+0069....

    > > The string (2) may have been produced to avoid displaying the dot
    > > with some fonts that don't apply the soft-dotted rule when there's
    > > an additional diacritic above...
    >
    > Don't do that. That's misusing the standard. The font should be fixed
    > instead.

    For whatever reason, encoded texts exist before correct fonts are used to
    render them. So there does exist texts which use dotless lowercase i before
    a diacritic above, simply because the author of the text did not want it to
    be rendered with a superposed dot. These texts are clearly not Turkic (in
    Turkish or Azeri, the dot of the soft-dotted i should have been displayed
    with the diacritic above it, and the dotless i should have been used to
    avoid it explicitly).

    But this is not the only reason, I can give other examples which also have
    security impacts and filesystems impact.

    Suppose you have a database of user names or file names allowing
    internationalized names coded along the recommanded Unicode principles. But
    these names are used in a way that makes it impossible to track the language
    in which these names are entered (filenames or users names or address fields
    in a entry form are such cases).

    Now provide a facility that allows to identify and avoid duplicate
    case-equivalents, using full mappings. Because you can't track the language,
    you'll need to use the default case-neutral full case mappings.

    Now a Turkish user enters a name or address in a entry form, or creates
    files with dotless lowercase i in it, and attempts to reenter later its case
    equivalent (dotless) uppercase I. The system will not identify both as being
    case equivalents, so it will accept both as if they were distinct.

    The Turkish user or the system then attempts to list files or database table
    fields matching some regular expression like "i*" with case insensitive
    option, to count the number of occurences of the names containing a
    (soft-)dotted i (or I). He will get all files containing one of three codes,
    and not the fourth one.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 05:53:29 EST