Date: Fri Mar 23 2007 - 20:36:48 CST
Thank-you Richard for stating things very clearly -- my apologies for
using "normalisation" in a very broad sense, and for clarifying for me
what "default ignorable" means.
Many processes do change text, whilst seeking to preserve what is
considered important, the examples that spring to my mind first some
e-mail sending set-ups, or GET in an html form.
Quoting Richard Wordingham <firstname.lastname@example.org>:
> John Knightley <email@example.com> wrote on Wednesday, March 21,
> 2007 3:29 PM
>> Quoting Eric Muller <firstname.lastname@example.org>:
>>> Not "normalization" proper, but rather "removal of default ignorable".
>>> That second operation is vastly more unlikely than normalization. For
>>> example, the W3C recommends the (early) normalization of XML documents
>>> but they certainly don't advocate that default ignorable be removed.
>> Since these are only recommendations this could happen in either
>> case, and still be 100% unicode compliant. Which means on still can
>> not have ones cake and eat it.
> Blanket removal of default ignorable characters is a transformation of
> the text, as it would strip out CGJ, ZWJ, ZWNJ, WJ, ZWSP and bidi
> controls, and is 'Unicode compliant' in the same way as case folding
> can be. (Normalising to NFD and then replacing every base character by
> 'x' and removing the rest is also a Unicode-compliant process.) Being
> 'default ignorable' means that in rendering the character can be
> ignored if the application does not support it; it does not mean that
> it can be dropped when text is transformed. It would be wrong for an
> application implicitly claiming not to change the text to strip
> variation selectors out of ideographic selectors without any by your
> leave. (By contrast, normalisation does not change the text for
> Unicode-compliant processes - some round-tripping is inherently not
> On the other hand, it might not be unreasonable for an application to
> compress such text by transferring the information in the variation
> selectors to a 'higher level protocol'. For a file consisting mostly
> of CJK text, appending U+E0100 to every unified ideograph would bloat
> the UTF-16 storage requirement from typically one code unit per
> character to typically three code units per character! Doug Ewell's
> survey of Unicode compression ( http://www.unicode.org/notes/tn14/ )
> rather suggests that many standard compression techniques would not
> counteract such bloat effectively.
This message sent through Virus Free Email
This archive was generated by hypermail 2.1.5 : Fri Mar 23 2007 - 20:40:29 CST