CLDR 1.5 beta/Unicode 5.0: character fallback substitutions

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 31 2007 - 18:18:37 CDT

  • Next message: Philippe Verdy: "RE: [OT]non-terrestrial writing systems"

    I see a strange sentence in the specification of the new "explicit"
    character fallback substitutions, specified in CLDR 1.5 beta
    "characters.xml" supplementary file. It says:

    "The recommended usage is that when a character value is not in the desired
    repertoire, the explicit substitutes from characters.xml are tested one by
    one against the repertoire, with the first substitute wholly in the
    repertoire being substituted for the value in the output. If no explicit
    substitute is found, then toNFC(value) is tried; if that fails then
    toNFKC(value) is tried."

    This definition seems to violate the current Unicode 5.0 rules, because
    explicit fallbacks (not canonically equivalent) would take precedence over
    NFC equivalents...

    Such definition would mean that renderers need to be changed to try
    fallbacks BEFORE converting the string to NFC, and this complicates
    significantly the implementation.

    I've looked at the current list of fallbacks, and in fact there is currently
    NO case where an explicit fallback comes along with a NFC fallback.

    The only significant change in those fallbacks is that there are now better
    fallbacks than NFKC compatibility equivalents (for example numerical
    fractions have an explicit fallback with a SPACE prior to the NFKC
    equivalent, making a better work for texts like "3<ONE HALF FRACTION>" which
    would fallback to "31/2" using NFKC, instead of the better "3 1/2" with the
    explicit fallback.

    So shouldn't this definition read as:

    "The recommended usage is that when a character value is not in the desired
    repertoire, then toNFC(value) is tried. If no NFC substitute is found, then
    the explicit substitutes from characters.xml are tested one by one against
    the repertoire, with the first substitute wholly in the repertoire being
    substituted for the value in the output; if that fails then toNFKC(value) is
    tried."

    Are you making this new definition for possible future fallbacks where it
    would be better to use another newer fallback than the current NFC
    substitutes (that can't be changed due to NFC stability)? If so, there's a
    need to change some of the requirements for Unicode 5.0 conformance (because
    this affects the character identity and the semantics), or the proposed new
    order should be just optional.

    For now, I see no justification (after looking at the proposed list) to
    change the order of resolution in a way that prefers breaking the canonical
    equivalence...

    ---
    I also see that the data currently proposes the string "PHP" (ISO currency
    code for the Philippan Peso) as an explicit fallback for the PESO SIGN, but
    I'm not sure that the PESO SIGN is restricted to the Philippines. I think
    that the "Ps" fallback would be more appropriate.
    Same thing for the WON SYMBOL that uses the explicit fallback "KRW" and
    assumes the South Korean currency, when the WON SYMBOL is also used for the
    North Korean Won (KRP)... Here also, I think that the "W/" fallback would be
    more appropriate...
    For such currency symbol substitutes, which are locale-dependant, may be
    these would be localizable using CLDR locale data if they must be kept.
    Philippe.
    


    This archive was generated by hypermail 2.1.5 : Thu May 31 2007 - 18:20:35 CDT