Re: Comments on DUTR #15

From: Mark Davis (
Date: Mon Jan 04 1999 - 12:40:44 EST

Thanks for your feedback My comments are interspersed below.


John Cowan wrote:

> Problems with DUTR #15, Unicode Normalization, revision 10
> 1. The definition of primary canonical combination (PCC) is flawed,
> as is plain by checking the examples and design goals. As written,
> D + dot_above + dot_below can be resolved to either D-dot_above +
> dot_below or D-dot_below + dot_above, but the examples make the
> latter the only possible resolution.
> A cure would be to add a proviso to definition 2, something like this:
> "provided that none of C_1, ... C_n can be primary canonically
> combined with C_0."

Your analysis is taking the definition in isolation. That definition is
used in the specification further down in the document. If you look at
step 2 of each of the specifications, there is no ambiguity since the
first match is found.

> 2. In my opinion, the concept of PCC currently used in the Draft is
> too difficult to implement and not really useful. The point of
> PCC, as I take it, is to exploit the easy roundtrippability of
> composed forms into legacy character sets (which is why we have
> composed forms at all) and to achieve a small measure of compression.
> However, cases like D + dot_above + dot_below will not roundtrip
> into character sets without combining characters (i.e. most of them),
> and character sets with combining characters (the bibliographic ones)
> will want the decomposed form anyhow. A system that cannot handle
> D + dot_below + dot_above cannot handle D-dot_below + dot_above
> either.
> Therefore, I propose that the rule be as follows: either the entire
> complex character (base plus all combining characters) can be
> canonically composed, or none of it can. So D + dot_above + dot_below
> is left alone, but S + dot_below + dot_above is correctly composed
> into U+1E68.
> This has the advantage of conceptual as well as implementation
> simplicity: each complex character can just be compared with a
> fixed mapping of strings -> characters, which either succeeds or
> fails.

I may not have explained it as well as I could, but there is no big
difference in complexity between them. I wrote a short demo applet (found
on that illustrates the
difference--source for the composition routines is attached if you want
to look at it. (comments welcome!)

There are a few reasons for having a more fine-grained approach. First,
with a coarse-grained approach like the one you are recommending, adding
an irrelevant mark, such as a macron_below, will cause a composite
character such as a-ring to decompose.

Second, the fine-grained approach composes sequences like:
<zero-class> <non-zero-class>* <zero-class>?

The coarse-grained approach composes sequences like:
<base> <combining-mark>*

The fine-grained approach will recompose items like some of the indic
split vowels that are not handled by the coarse-grained approach.

> 3. As a matter of developer convenience, it would be useful to
> provide canonical composition mappings in the UTR, rather than
> defining them by reference to a particular character database
> (which will inevitably become obsolete) as modified by various
> characters removed (the Hebrew composites at least). Such a fixed
> list would remain attached to the UTR and provide the desirable
> stability property, without making it necessary to eventually
> keep two character databases around. There are less than 1000
> combined characters, with 2-4 characters in the canonical equivalent.

I think this is a very good suggestion. Until the proposal is final, I'd
rather leave it as it is, to make sure that we clearly maintain the
synchronization with the database, but once done, we can build a single
table that is attached to the UTR.

> --
> John Cowan
> e'osai ko sarji la lojban.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT