RE: Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 16 2003 - 10:48:58 EST

  • Next message: Winkler, Arnold F: "RE: Stability of WG2"

    Michael Everson wrote:
    > At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
    > >Doug Ewell <dewell@adelphia.net> writes:
    > > > > Wrong here: I have found occurences of dotless lowercase i, used
    > > > > instead of soft-dotted lowercase i, as base letters for diacritics
    > > > > added above it (it was an accute accent...)
    > > >
    > > > Don't do that.
    > >
    > >What? This is VALID UNICODE to have texts coded like this.
    >
    > In Irish, it is INCORRECT to spell "físeán"
    > 'video' with a DOTLESS I + COMBINING ACUTE. It is
    > a spelling error, and will fail in
    > spell-checking. The correct spelling is either I
    > + COMBINING ACUTE or precomposed I WITH ACUTE.

    Spelling was not the issue there. Only Unicode validity.

    > >For whatever reason, encoded texts exist before correct fonts are used to
    > >render them. So there does exist texts which use dotless lowercase i
    > >before a diacritic above, simply because the author of the text did not
    > >want it to be rendered with a superposed dot.
    >
    > Texts which contain spelling errors. Or old IPA
    > texts using any number of ad-hoc IPA font
    > solutions. Those texts have to be transcoded to
    > proper Unicode at some stage. What you suggest is
    > Not Recommended.

    Not recommanded but still valid (and actually used in Turkish as well!), and
    used in some occasions because of defects in fonts that don't have a
    precomposed glyph for letter i with the diacritic but have a separate glyph
    for the combining diacritic and for the dotted and dotless letters i, or
    that use renderers unable to remove the soft dot. The IPA-93 font is such
    one, which allows good typesetting, but which needs glyph processing to
    select the appropriate base letter.

    My main issue is, however with Turkish names found in environments where
    language identification is not possible (for example a simple filename or a
    locale-neutral database field or an international HTML form which requests
    user names and use them as case insensitive identifiers); lowercase dotless
    i do not work appropriately there.

    I think it is completely illogical to match together with case-insensitive
    compares, the three letters:
            LATIN SMALL LETTER I (dotted)
            LATIN CAPITAL LETTER I (dotless)
            LATIN CAPITAL LETTER I WITH DOT ABOVE
    but not:
            LATIN SMALL LETTER DOTLESS I
    when use locale-neutral compares, given that the normative uppercase mapping
    of this fourth letter is the second letter above.

    I'm sorry that nobody wants to admit it, and that this is a security issue
    which causes problems when applications that expect a case-insensitive
    difference means that converting the string to either lowercase or uppercase
    or titlecase will preserve this difference.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 11:35:15 EST