Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: Richard Wordingham (
Date: Mon Mar 26 2007 - 17:48:30 CST

  • Next message: Richard Wordingham: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"

    Doug Ewell wrote on Monday, March 26, 2007 2:55 PM

    > Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

    >> It would be wrong for an application implicitly claiming not to change
    >> the text to strip variation selectors out of ideographic selectors
    >> without any by your leave. (By contrast, normalisation does not change
    >> the text for Unicode-compliant processes - some round-tripping is
    >> inherently not Unicode-compliant.)

    > This doesn't sound right to me. Normalization is all about changing one
    > character or sequence to another.

    It boils down to the interpretation of conformance clauses C6 and C7:

    'C6: A process shall not assume that the interpretations of two
    canoncial-equivalent character sequences are distinct.'

    'C7: When a process purports not to modify the interpretation of a valid
    coded character sequence, it shall make no change to that character sequence
    other than the possible replacement of character sequences by their
    canonical-equivalent sequences or the deletion of noncharacter code points.'

    There was an inconclusive discussion about it in late 2003, referred to in
    UTN 14, back when the clauses were C9 and C10, on the topic of whether
    compressing text by converting it to NFC constituted a change to the text.
    A significant argument was that Unicode-encoded text would often be used by
    processes that were not 'Unicode-compliant' - more precisely C6-compliant.
    (And Unicode-compliant default upper-casing - Clause C20 - is not quite
    compliant with Clause C6, though the default upper-casing seems to be wrong
    anyway for all the cases of discrepancy I can assign a plausible meaning

    > -- especially if compatibility normalization (NFKC or NFKD) is involved.

    A red herring. The explanation of C7 states, 'Replacement of a character
    sequence by a compatibility-equivalent sequence _does_ modify the
    interpretation of the text.'

    A key point is that C6-compliant processes cannot care whether the data has
    been transformed in a manner preserving canonical equivalence with the
    original. Round-trip conversion is not a C6-compliant process if it relies
    on compatibility characters with canonical decompositions - so nor is a
    renderer that respects the differences between CJK compatibility ideographs
    and their singleton decompositions. CJK compatibility ideographs serve no
    useful purpose if they are only interpeted by Unicode-compliant processes!
    This immediately and unfortunately implies that if:

    1) Round-trip conversion from a 'legacy' character set required CJK
    compatibility ideographs before the advent of IVS;
    2) One does not use mark-up to preserve the distinctions lost in normalised
    Unicode; and
    3) One intends to display the text using Unicode-compliant processes

    Then IVS is the only way to preserve the graphic distinctions.

    >> For a file consisting mostly of CJK text, appending U+E0100 to every
    >> unified ideograph would bloat the UTF-16 storage requirement from
    >> typically one code unit per character to typically three code units per
    >> character! Doug Ewell's survey of Unicode compression (
    >> ) rather suggests that many standard
    >> compression techniques would not counteract such bloat effectively.

    > This is true for compression techniques that operate on one code point at
    > a time, such as SCSU and BOCU and Huffman coding. It may not be true for
    > dictionary-based techniques like LZ.

    LZ77 performs about 20% better on SCSU-compressed text from small alphabets
    than on the text in UTF-16. I will agree that compressors using the
    Burrows-Wheeler algorithm will probably counteract the bloat very

    > The question of how desirable it is to append a variation selector to
    > every character in the first place is perhaps more generally interesting.

    Which is why I chose the evaluative term 'bloat'.


    This archive was generated by hypermail 2.1.5 : Mon Mar 26 2007 - 17:50:40 CST