From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 29 2004 - 19:56:31 EDT
Peter Kirk wrote, in response to Ernest Cline:
> >... It simply is impossible
> >to simulate non-zero canonical combining class characters in Unicode
> >with anything other than a character with the appropriate canonical
> >combining class. ...
> >
>
> True. But fortunately Unicode don't really need to worry about
> normalisation of PUA data, as this is surely out of its scope.
Not quite. PUA code points are subject to the Unicode normalization
algorithm, as well as any other. Their behavior in NFC or NFD,
for example, is rigidly defined, if trivial: a PUA code point
normalizes to itself.
That doesn't prevent a user of PUA code points, defined as
whatever characters, from also defining equivalences among
PUA characters (or sequences) or between PUA characters
(or sequences) and standard Unicode characters (or sequences),
and then "normalizing" data based on those equivalences.
Any such normalization would *not*, however, be Unicode
normalization, as defined in UAX #15. Such equivalences, which
depend on interpretations of PUA code points that are the
result of private agreement, are themselves subject to
private agreement.
So, to specify it in terms of actual code points, and using
my previous example, if I define:
U+E000 = SNOREFRED LOGO CHARACTER (glyph: an elmtree with two chipmunks)
U+E001 = ELMTREE SYMBOL
U+E002 = COMBINING CHIPMUNK
There is nothing which would prevent me from asserting that
<E000> is equivalent to <E001, E002, E002>, on my interpretation
of what those characters mean.
On the other hand, I could not expect any software doing
Unicode normalization to pay any attention to *my* interpretation
of those equivalences, and if I really wanted to process data
using such equivalences, it would be up to me to write the
software to do so.
> So it
> would not be fatal to use class 0 combining characters in PUA scripts,
> and leave to the user any possible burden of ensuring that multiple
> combining marks are correctly ordered. It would be enough to indicate
> that a particular character is, or is currently being used as, a
> combining mark.
Correct. And to ensure that any font that you designed handled
it correctly.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Apr 29 2004 - 20:29:28 EDT