Re: Implementing NFC

From: Eric Muller (emuller@adobe.com)
Date: Sat Mar 17 2007 - 10:20:20 CST

  • Next message: John Hudson: "Re: Vista Fonts"

    Daniel Ehrenberg wrote:
    > I'm just wondering, are there any other programming languages that
    > handle Unicode by storing strings in a consistently normalized form?
    I don't know of any, but you should realize that this comes at a
    functional cost.

    Consider writing a text editor and consider the Windows Vietnamese
    keyboard. Because of the layout of this keyboard, data entered with it
    is not in a normalized form; for example, ễ is entered by hitting two
    keystrokes, the first generating U+00EA ê LATIN SMALL LETTER E WITH
    CIRCUMFLEX, the second generating U+0303 ◌̃ COMBINING TILDE. Your
    approach means that the stored text is either <U+1EC5 ễ LATIN SMALL
    LETTER E WITH CIRCUMFLEX AND TILDE> (if you choose NFC) or <U+0065 e
    LATIN SMALL LETTER E, U+0302 ◌̂ COMBINING CIRCUMFLEX ACCENT, U+0303 ◌̃
    COMBINING TILDE> (if you choose NFD). In either case, the number of
    characters see by the editor and the number of keystrokes do not match.
    If you want to build your editor so that <any key, delete> is a no-op,
    then you need to compensate for this mismatch, and in fact you need to
    have a detailed knowledge of the keyboard in your editor. This sound a
    bit much to me.

    Another area where normalization is painful is if you intend to support
    other character sets than Unicode and achieve that by using the
    round-trip capabilities of Unicode (the tenth design principle:
    "Accurate convertibility is guaranteed between the Unicode Standard and
    other widely accepted standards"). These round-trip capabilities are
    guaranteed only if data is not normalized on the way. The most obvious
    case are the CJK compatibility ideographs which have been encoded
    precisely for the purpose of round-tripping, yet disappear if
    normalization is applied.

    Personally, my rule of thumb (when building software) is to not
    normalize until explicitly asked by the user, or unless I know that the
    resulting data will have limited uses for which normalization does not
    interfere. The lower in the food chain my software is (and a general
    purpose programming language runtime is about as low as one can get) the
    more I follow this rule.

    Eric.



    This archive was generated by hypermail 2.1.5 : Sat Mar 17 2007 - 10:23:27 CST