Re: PRI#86 Update

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 10 2006 - 02:40:16 CDT

  • Next message: Philippe Verdy: "Re: Unicode 5.0 decompositions of Balinese vowel signs with tedung"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > Philippe Verdy wrote on Tuesday, May 09, 2006 3:48 PM
    >
    >> A conforming application should then be free to reject texts containing
    >> codepoints that they still don't support in their builtin version of the
    >> UCD. If an application tolerates those texts, then they should not assume
    >> the stability of normalized forms, and so should better not apply any
    >> normalization, to keep the texts intact (this is a conforming behavior, as
    >> normalization of texts is not mandatory in conforming applications).
    >
    > Please give an example of how normalising text with an undefined character
    > can corrupt the text.

    I did not use the terms "corrupt the text", only you did, assuming that normalization instability means a text corruption (which is wrong, except for editorial errors found in a older version Unicode standard itself, such as the few compatibility Han ideographs whose canonical mappings were corrected later in a corrigendum, something that can beconsidered as an unstability, and explaining why Unicode lists them explicitly)

    I just spoke about *normalization stability* (which is what is really explained more precisely in the proposed PRI#86), which means that the normalization produced on a text that contains undefined characters is not stable until the character is defined, simply because its combining class isstill undefined, as well as the canonical orcompatibility equivalences from which it may be a part later.

    As long as the character is not defined, the normlizer cannot alter it; and needs to treat it as a base character with cc=0, with no compatibility or canonical equivalence, and this character is not referenced within any equivalence mapping of another character (something that may happen later)

    So the normalizer will only work on defined character spans before of after any undefined character, but will keep it in place, between those spans.



    This archive was generated by hypermail 2.1.5 : Wed May 10 2006 - 02:42:21 CDT