Re: An attempt to focus the PUA discussion [long]

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 01 2004 - 04:04:22 CST


----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <ernestcline@mindspring.com>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Saturday, May 01, 2004 4:16 AM
Subject: Re: An attempt to focus the PUA discussion [long]

> > Providing
> > private use characters with a default ccc other than 0 would
> > open combining classes for private use in a manner that
> > could be consistently normalized regardless of whether
> > the implementation was a party to the private use or not.
>
> Note that these could *not* be any existing PUA code points,
> for the folowing reason.
>
> Currently:
>
> code point: 0061 100001 0323
> ccc 0 0 220
>
> Normalized to NFD: <0061, 10001, 0323>
>
> Say we decided that some block of PUA characters, including
> U+100001, would have ccc=230, as a kind of private-use
> combining mark above, at least as a "default" property.
>
> Future:
>
> code point: 0061 100001 0323
> ccc 0 230 220
>
> Normalized to NFD: <0061, 0323, 10001>
>
> Clearly this is disallowed by normalization stability guarantees.
>
> So if you or anybody else is proposing such a change, make
> sure that it is in the context of defining a *new*
> block of private use characters, off the BMP and not
> Planes 15 or 16.

I would like to oppose to your point of view: an application that does not know
what is the private codepoint 10001 will need to (and MUST) handle it with
combining class 0 to guarantee stability of the encoded text, simply because it
does not know if its a symbol, a base letter or a combining character or a
format control. It will preserve the order in any case.

An application that _knows_ what 10001 means (by knowing which private
convention is used and intended by its user), may assign its own properties,
including changing the combining class from 0 to 230, and thus allowing
reordering of the sequence above if it matches the private convention. This
means that the sequence above _will_ be reordered to 0061 0323 10001, and the
10001 does not block now the composition of 0061 0323 (if such composition
exists, I did not check what these codepoints mean, but it does not matter
here).

---
In fact the private codepoint 10001 could have itself its own private
decomposition to 0062 0312 for example (representing a b with turned comma
above), as if this character encoded a missing composition in Unicode. So
PRIVATELY a normalizer knowing that private convention could turn it to:
code point: 0061 100001 0323
ccc            0     0         220
_Private_ decomposition:
code point: 0061 0062 0312 0323
ccc            0     0      0230 220
Normalized to NFD:  <0061, 0062, 0323, 0312>
ccc                        0       0       220    230
Normalized to NFC:  <0061, 01E5, 0312>
ccc                        0       0       230
A PUA cannot be limited to encode only characters that absolutely no other way
to be encoded in Unicode. private compositions would be also valid for local
private usages (in fact, many Unicode-compliant fonts glyph substitution rules
already perform such private compositions, by mapping for the composed glyph a
PUA...).
In the example above, it shows that it can help produce text which no longer
includes PUA. PUA may be a useful tool for local usage, when private precomposed
characters are wanted, but missing in the Unicode standard.


This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT