Re: Is it roundtripping or transfer-encoding

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 21 2004 - 09:48:46 CST

  • Next message: Peter Kirk: "Re: Character disunification"

    RE: Is it roundtripping or transfer-encodingFrom: Lars Kristan
    > OK, so it introduces a multiple representation of the same codepoints.
    > Every escaping technique does that. And it is not a problem. All you need
    > to do is define the normalization procedure. And use it where it applies.
    > In many cases its use is not even necessary. Specifically, a Unicode
    > system does not need to (and should not) normalize the escape codepoints.
    > The need for normalization only needs to be determined for an application
    > that uses the TES itself, and applies only in few cases.

    Please don't use the term "normalize" in this context. Normalization in
    Unicode involves transformation of the stream of *code points*, but is
    independant of their encoding form or encoding scheme. Normalization is
    exposed in terms of combining sequences and mostly the "combining class"
    property of characters and the character composition mapping property (plus
    some values of the "general category" property, to take control characters
    into account when delimiting combining sequences).

    Unicode defines only 4 *standard* normalization forms (NFC, NFD, NFKC,
    NFKD), but other *non-standard* normalization forms are possible:

    Normalization involves transformation of strings of abstract characters that
    should be considered "equivalent" for text processing (notably for input
    text, but normalization may apply optionally and less importantly for output
    text of these processes).

    Unicode defines two sets of equivalence classes for encoded texts:
    "canonical" equivalence (NFC or NFD, or the non-standard special
    decomposition form used on MacOS for HFS+ volumes), important for some other
    important standards depending on Unicode, and "compatibility" equivalence
    (NFKC, NFKD), each equivalence type defined with "composed" and "decomposed"
    forms, important only for fallback mechanisms (but compatibility mappings
    can involve loss of some information in the source text).



    This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 11:57:33 CST