Re: *Why* are precomposed characters required for "backward compatibility"?

From: Doug Ewell (
Date: Thu Jul 11 2002 - 22:14:53 EDT

David Hopwood <david dot hopwood at zetnet dot co dot uk> wrote:

> OTOH, there can be more than one way to represent composites that
> include two or more diacritics in different combining classes (e.g.
> <e with circumflex and dot below>). Technically, that would mean that
> strict byte-for-byte round-tripping of X -> NFD -> X would not be
> guaranteed in every case (unless X also requires that all data is
> normalised). This doesn't apply to T.61, but it does apply to other
> standards such as TIS620 (ISO-Latin-11 / Thai), which have combining
> marks in more than one class.

As you mentioned, this does not apply to T.61 or ISO 6937, because they
do not permit multiple diacritics to be applied to a single base

> Users have basically ignored (if they are even aware of) any
> admonitions from standards institutions to treat U+005E, U+0060 or
> U+007E as spacing accents, and continued to use them for the purposes
> listed below:

Programming languages, notably C and its offspring, have appropriated
these characters for their own purposes. You can't really blame "users"
for that.

> So, there would have been no practical problem with disunifying
> spacing circumflex, grave, and tilde from the above US-ASCII
> characters, so that the preferred representation of all spacing
> diacritics would have been the combining diacritic applied to U+0020.

Except, of course, for any additional user confusion that might have
arisen from encoding three more lookalike "spoof buddies." Unicode is
already taking a lot of heat on the IDN list for not unifying all
"lookalike" pairs.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Thu Jul 11 2002 - 20:32:44 EDT