Re: *Why* are precomposed characters required for "backward compatibility"?

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Thu Jul 11 2002 - 21:23:20 EDT


-----BEGIN PGP SIGNED MESSAGE-----

Dan Oscarsson wrote:
> From David Hopwood <david.hopwood@zetnet.co.uk>
>
> >The only difficulty would have been if a pre-existing standard had supported
> >both precomposed and decomposed encodings of the same combining mark. I don't
> >think there are any such standards (other than Unicode as it is now), are
> >there?
>
> Yes. T.61 is still in use.

See <ftp://dkuug.dk/i18n/charmaps/ISO_6937> for a mapping table for ISO 6937,
which is apparently a superset of T.61 (according to
<http://216.239.37.100/search?q=cache:zhcOUX1S73YC:www.terena.nl/projects/multiling/euroml/section09.html>).

> It uses combining accents.

As John Cowan pointed out, neither T.61, nor the other suggested counterexample
of ISO 5426-1980 or ANSEL, have more than one way to represent a composite with a
single diacritic.

OTOH, there can be more than one way to represent composites that include two or
more diacritics in different combining classes (e.g. <e with circumflex and
dot below>). Technically, that would mean that strict byte-for-byte round-
tripping of X -> NFD -> X would not be guaranteed in every case (unless X also
requires that all data is normalised). This doesn't apply to T.61, but it does
apply to other standards such as TIS620 (ISO-Latin-11 / Thai), which have
combining marks in more than one class.

I very much doubt, though, that this would be considered a significant practical
problem; there would still be no semantic information lost by round-tripping.
Also some of the languages that it might otherwise affect aren't typically
encoded using legacy charsets with combining characters (e.g. the most common
legacy encoding for Vietnamese is VISCII, which has only precomposed characters).

> One place where it is used is in X.500.

Incidentally, AFAICT most recent implementations of X.509 etc. don't bother to
convert correctly to and from T.61 :-(

> It also have the nice way where the combining accent
> comes before the base character making it easier to parse.

That's debateable. I don't think it matters one way or the other as long as only
one ordering is used.

> >(Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
> >That wouldn't have been much of a loss; it would still have been an extension
> >of US-ASCII.)
>
> NFD should not be an extension of ASCII. There are several spacing
> accents in ASCII that should be decomposed just like the spacing accents in
> ISO 8859-1 are decomposed.
> All or none spacing accents should be decomposed.

Users have basically ignored (if they are even aware of) any admonitions from
standards institutions to treat U+005E, U+0060 or U+007E as spacing accents,
and continued to use them for the purposes listed below:

Character Common uses

U+005E CIRCUMFLEX ACCENT to indicate superscripts; 'to the power of';
                            'exclusive-or' in some programming languages

U+0060 GRAVE ACCENT opening single quote; 'backtick' in some programming
                            & shell languages

U+007E TILDE as prefix: 'not' in programming languages; home
                              directories in Unix; symbol for 'approximately'
                            as suffix: backup filenames in Unix
                            (preferred glyph is middle tilde, which is not the
                             same as a spacing tilde accent anyway)

For all of these characters, use as a spacing diacritic is actually much
less common than any of the other uses listed above. Even when they are used
to represent accents, it is usually as a fallback representation of a combining
accent, not as a true spacing accent.

So, there would have been no practical problem with disunifying spacing
circumflex, grave, and tilde from the above US-ASCII characters, so that the
preferred representation of all spacing diacritics would have been the
combining diacritic applied to U+0020.

> I could ask why are not precomposed characters preferred to be used, if
> they exist?

They are, in HTML, XML, etc.

> For a lot of text handling precomposed characters are much easier to
> handle, especially when the combining character comes after instead of
> before the base character.

I thought you said approximately the opposite in relation to T.61 above :-)

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPS4vADkCAxeYt5gVAQEyrgf/YQQebmjtCEx9pazUFuxATH5ABvrazbIh
JLhoqh9/1MimkP1asBN08L3VUAz/pDvVFj/TKbGqqPEgrKTkzfbaGIPcACLzsPzV
oUJMx7aBerQZXzHQLRFqVlQ9Q37IRTEm/c9+KXbOwNVEBGCshUSymvyrSQ2mT0bM
s6I8bVwtrkdL4kffAGxaqlZdCG24VJSTfUOjKq1kK0MQecOCLk/sDmz+koxuD8Cq
KQZIEupDG+MsOPjWbRRf1kzgAQEtM0fRa/PToZmRQDzz9zmz0ZJK9nKAJHGNZdO+
St9qQcwua4AcAZ22O6R8zi3CRYT5cCzHwVA/08txZkT8fFLltGi3wQ==
=AEOv
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Thu Jul 11 2002 - 19:30:40 EDT