Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: Doug Ewell (
Date: Mon Mar 26 2007 - 07:55:50 CST

  • Next message: James Tu: "Arabic and Adobe Flash"

    Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

    > It would be wrong for an application implicitly claiming not to change
    > the text to strip variation selectors out of ideographic selectors
    > without any by your leave. (By contrast, normalisation does not
    > change the text for Unicode-compliant processes - some round-tripping
    > is inherently not Unicode-compliant.)

    This doesn't sound right to me. Normalization is all about changing one
    character or sequence to another. A Unicode-compliant process is not
    supposed to assume that two canonical-equivalent sequences will be
    treated differently, but that is not the same as saying the text has not
    changed -- especially if compatibility normalization (NFKC or NFKD) is

    > On the other hand, it might not be unreasonable for an application to
    > compress such text by transferring the information in the variation
    > selectors to a 'higher level protocol'. For a file consisting mostly
    > of CJK text, appending U+E0100 to every unified ideograph would bloat
    > the UTF-16 storage requirement from typically one code unit per
    > character to typically three code units per character! Doug Ewell's
    > survey of Unicode compression ( )
    > rather suggests that many standard compression techniques would not
    > counteract such bloat effectively.

    This is true for compression techniques that operate on one code point
    at a time, such as SCSU and BOCU and Huffman coding. It may not be true
    for dictionary-based techniques like LZ. The question of how desirable
    it is to append a variation selector to every character in the first
    place is perhaps more generally interesting.

    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Mon Mar 26 2007 - 07:58:12 CST