Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: [email protected]
Date: Fri Mar 23 2007 - 20:36:48 CST

Next message: [email protected]: "Re: Encoding Pronunciation (was: Comment on PRI 98: IVD Adobe-Japan1 (pt.2))"

Previous message: Richard Wordingham: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
In reply to: Richard Wordingham: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
Next in thread: Doug Ewell: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Thank-you Richard for stating things very clearly -- my apologies for
using "normalisation" in a very broad sense, and for clarifying for me
what "default ignorable" means.

Many processes do change text, whilst seeking to preserve what is
considered important, the examples that spring to my mind first some
e-mail sending set-ups, or GET in an html form.

John Knightley

Quoting Richard Wordingham <[email protected]>:

> John Knightley <[email protected]> wrote on Wednesday, March 21,
> 2007 3:29 PM
>
>> Quoting Eric Muller <[email protected]>:
>
>>> Not "normalization" proper, but rather "removal of default ignorable".
>>> That second operation is vastly more unlikely than normalization. For
>>> example, the W3C recommends the (early) normalization of XML documents
>>> but they certainly don't advocate that default ignorable be removed.
>
>> Since these are only recommendations this could happen in either
>> case, and still be 100% unicode compliant. Which means on still can
>> not have ones cake and eat it.
>
> Blanket removal of default ignorable characters is a transformation of
> the text, as it would strip out CGJ, ZWJ, ZWNJ, WJ, ZWSP and bidi
> controls, and is 'Unicode compliant' in the same way as case folding
> can be. (Normalising to NFD and then replacing every base character by
> 'x' and removing the rest is also a Unicode-compliant process.) Being
> 'default ignorable' means that in rendering the character can be
> ignored if the application does not support it; it does not mean that
> it can be dropped when text is transformed. It would be wrong for an
> application implicitly claiming not to change the text to strip
> variation selectors out of ideographic selectors without any by your
> leave. (By contrast, normalisation does not change the text for
> Unicode-compliant processes - some round-tripping is inherently not
> Unicode-compliant.)
>
> On the other hand, it might not be unreasonable for an application to
> compress such text by transferring the information in the variation
> selectors to a 'higher level protocol'. For a file consisting mostly
> of CJK text, appending U+E0100 to every unified ideograph would bloat
> the UTF-16 storage requirement from typically one code unit per
> character to typically three code units per character! Doug Ewell's
> survey of Unicode compression ( http://www.unicode.org/notes/tn14/ )
> rather suggests that many standard compression techniques would not
> counteract such bloat effectively.
>
> Richard.

-------------------------------------------------
This message sent through Virus Free Email
http://www.vfemail.net

Next message: [email protected]: "Re: Encoding Pronunciation (was: Comment on PRI 98: IVD Adobe-Japan1 (pt.2))"
Previous message: Richard Wordingham: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
In reply to: Richard Wordingham: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
Next in thread: Doug Ewell: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 23 2007 - 20:40:29 CST