Re: Case folding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jun 08 2006 - 19:48:10 CDT

  • Next message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

    Mike wrote on Friday, June 09, 2006 at 12:02 AM

    > Some of the recent discussions have led me to
    > question my implementation of upper/lowercasing
    > and case folding. Currently I simply iterate
    > through a string exchanging characters with
    > their replacements. I don't first normalize
    > to any form, or do any reordering of combining
    > marks afterward.

    > My question is, should I be doing these things?

    Like a lot of things, it depends on why you are doing them. If your clients
    are dumb, Unicode-non-compliant processes that are only going to do binary
    comparison on the outputs, the only normalisation you should do is to make
    sure that when U+0345 COMBINING GREEK YPOGEGRAMMENI become an iota, it moves
    to the end of the sequence of non-zero combining class characters (and two
    Tibetan nasties) following. (The Unicode Standard is quite frankly unclear
    on this - it tells you what to do in uppercasing of you want the
    linguistically correct outcome, but leaves matters irritatingly vague for
    someone trying to implement conversions in strict compliance to the
    standard. Perhaps you should provide the option of a jobsworth
    interpretation and a linguistically correct one. There is a third option -
    not to convert the subscript to an iota, but that would be tailoring for a
    specific variety of Greek. This is not what you want to hear, and I had
    hoped to get some guidance from inner Unicode counsels before publicising
    the problem.) This assumes that your clients wish to make a distinction
    between one-character e-acute U+00E9 and the two-character e-acute <U+0065,
    U+0301>. Note however that case folding necessarily does some partial
    decomposition of composites.

    However, if your clients are going to interpret the sequences as text, it is
    helpful if you can provide the output in NFC or NFD as required - and
    occasionally NFKC and NFKD may be wanted. Too many processes demand NFC -
    it's being proposed as an extension of ASCII for some Internet
    applications - I presume for things like e-mail headers. For doing
    user-customised collation, NFD may be better, for the collation weightings
    are defined in terms of NFD.

    One nasty practicality is that some fonts display canonically equivalent
    sequences differently. In such cases, the dumb approach may be best. If
    someone has painfully worked out the best way of expressing a
    diacritic-laden grapheme cluster for the fonts at hand, your best bet on
    changing its case would be to make as little change to its composition as
    possible. I rather suspect the best-displayed form of capital A with
    circumflex and dot below will be obtained with <U+00C2, U+0323>, even though
    that is neither NFC nor NFD - the NFC form is <U+1EAC>, but a font may very
    well not support it.

    Richard.



    This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 19:58:46 CDT