From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Mar 26 2007 - 17:48:30 CST
Doug Ewell wrote on Monday, March 26, 2007 2:55 PM
> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>> It would be wrong for an application implicitly claiming not to change 
>> the text to strip variation selectors out of ideographic selectors 
>> without any by your leave.  (By contrast, normalisation does not change 
>> the text for Unicode-compliant processes - some round-tripping is 
>> inherently not Unicode-compliant.)
> This doesn't sound right to me.  Normalization is all about changing one 
> character or sequence to another.
It boils down to the interpretation of conformance clauses C6 and C7:
'C6: A process shall not assume that the interpretations of two 
canoncial-equivalent character sequences are distinct.'
'C7: When a process purports not to modify the interpretation of a valid 
coded character sequence, it shall make no change to that character sequence 
other than the possible replacement of character sequences by their 
canonical-equivalent sequences or the deletion of noncharacter code points.'
There was an inconclusive discussion about it in late 2003, referred to in 
UTN 14, back when the clauses were C9 and C10, on the topic of whether 
compressing text by converting it to NFC constituted a change to the text. 
A significant argument was that Unicode-encoded text would often be used by 
processes that were not 'Unicode-compliant' - more precisely C6-compliant. 
(And Unicode-compliant default upper-casing - Clause C20 - is not quite 
compliant with Clause C6, though the default upper-casing seems to be wrong 
anyway for all the cases of discrepancy I can assign a plausible meaning 
to.)
> -- especially if compatibility normalization (NFKC or NFKD) is involved.
A red herring.  The explanation of C7 states, 'Replacement of a character 
sequence by a compatibility-equivalent sequence _does_ modify the 
interpretation of the text.'
A key point is that C6-compliant processes cannot care whether the data has 
been transformed in a manner preserving canonical equivalence with the 
original.  Round-trip conversion is not a C6-compliant process if it relies 
on compatibility characters with canonical decompositions - so nor is a 
renderer that respects the differences between CJK compatibility ideographs 
and their singleton decompositions.  CJK compatibility ideographs serve no 
useful purpose if they are only interpeted by Unicode-compliant processes! 
This immediately and unfortunately implies that if:
1) Round-trip conversion from a 'legacy' character set required CJK 
compatibility ideographs before the advent of IVS;
2) One does not use mark-up to preserve the distinctions lost in normalised 
Unicode; and
3) One intends to display the text using Unicode-compliant processes
Then IVS is the only way to preserve the graphic distinctions.
>> For a file consisting mostly of CJK text, appending U+E0100 to every 
>> unified ideograph would bloat the UTF-16 storage requirement from 
>> typically one code unit per character to typically three code units per 
>> character!  Doug Ewell's survey of Unicode compression ( 
>> http://www.unicode.org/notes/tn14/ ) rather suggests that many standard 
>> compression techniques would not counteract such bloat effectively.
> This is true for compression techniques that operate on one code point at 
> a time, such as SCSU and BOCU and Huffman coding.  It may not be true for 
> dictionary-based techniques like LZ.
LZ77 performs about 20% better on SCSU-compressed text from small alphabets 
than on the text in UTF-16.  I will agree that compressors using the 
Burrows-Wheeler algorithm will probably counteract the bloat very 
effectively.
> The question of how desirable it is to append a variation selector to 
> every character in the first place is perhaps more generally interesting.
Which is why I chose the evaluative term 'bloat'.
Richard. 
This archive was generated by hypermail 2.1.5 : Mon Mar 26 2007 - 17:50:40 CST