Comments on DUTR #15

From: John Cowan (cowan@locke.ccil.org)
Date: Wed Dec 23 1998 - 22:34:45 EST


Problems with DUTR #15, Unicode Normalization, revision 10

1. The definition of primary canonical combination (PCC) is flawed,
as is plain by checking the examples and design goals. As written,
D + dot_above + dot_below can be resolved to either D-dot_above +
dot_below or D-dot_below + dot_above, but the examples make the
latter the only possible resolution.

A cure would be to add a proviso to definition 2, something like this:
"provided that none of C_1, ... C_n can be primary canonically
combined with C_0."

2. In my opinion, the concept of PCC currently used in the Draft is
too difficult to implement and not really useful. The point of
PCC, as I take it, is to exploit the easy roundtrippability of
composed forms into legacy character sets (which is why we have
composed forms at all) and to achieve a small measure of compression.
However, cases like D + dot_above + dot_below will not roundtrip
into character sets without combining characters (i.e. most of them),
and character sets with combining characters (the bibliographic ones)
will want the decomposed form anyhow. A system that cannot handle
D + dot_below + dot_above cannot handle D-dot_below + dot_above
either.

Therefore, I propose that the rule be as follows: either the entire
complex character (base plus all combining characters) can be
canonically composed, or none of it can. So D + dot_above + dot_below
is left alone, but S + dot_below + dot_above is correctly composed
into U+1E68.

This has the advantage of conceptual as well as implementation
simplicity: each complex character can just be compared with a
fixed mapping of strings -> characters, which either succeeds or
fails.

3. As a matter of developer convenience, it would be useful to
provide canonical composition mappings in the UTR, rather than
defining them by reference to a particular character database
(which will inevitably become obsolete) as modified by various
characters removed (the Hebrew composites at least). Such a fixed
list would remain attached to the UTR and provide the desirable
stability property, without making it necessary to eventually
keep two character databases around. There are less than 1000
combined characters, with 2-4 characters in the canonical equivalent.

-- 
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT