Re: *Why* are precomposed characters required for "backward compatibility"?

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Sat Jul 13 2002 - 02:54:06 EDT


-----BEGIN PGP SIGNED MESSAGE-----

Dan Oscarsson wrote:
> >From: David Hopwood <david.hopwood@zetnet.co.uk>
>
> >For all of these characters, use as a spacing diacritic is actually much
> >less common than any of the other uses listed above. Even when they are used
> >to represent accents, it is usually as a fallback representation of a combining
> >accent, not as a true spacing accent.
> >
> >So, there would have been no practical problem with disunifying spacing
> >circumflex, grave, and tilde from the above US-ASCII characters, so that the
> >preferred representation of all spacing diacritics would have been the
> >combining diacritic applied to U+0020.
>
> Apart from the problems Kenneth Whistler mentioned.

I'm not sure which post you're referring to. Possibly message-id
<200207120235.TAA03206@birdie.sybase.com>? That post argued that characters
in the US-ASCII range should not have decompositions (which I entirely agree
with), but it did not give any arguments against disunifying spacing circumflex,
grave and tilde from U+005E (caret), U+0060 (backtick), and U+007E (middle tilde).

> You would get the same problems with the ISO 8859-1 spacing accents but
> there are less people using them than with those in ASCII.

None of the spacing accents are commonly used, at least in Latin scripts.
Perhaps we have different ideas of what a spacing accent is? To me, the
following sentence contains spacing accents:

  "French uses the accents acute ( ́), grave ( ̀), circumflex ( ̑),
   diaeresis ( ̈), and cedilla ( ̧)."

I.e. almost the only use of spacing accents is to talk about accents.
That's a perfectly reasonable thing to do, and it should be supported, but
something that is used so infrequently is not going to cause many complaints
regardless of how the encoding is handled.

A caveat to this, as I said above, is that spacing accent characters are
sometimes used as a *fallback* when a charset has no representation for a
particular composite abstract character. But that is never necessary in
Unicode; it should only occur in Unicode text as a result of conversion
from some other charset.

> One problem is that some characters can be used as an accent and as
> a normal base character,

No, there are no Unicode characters can be used both as a combining accent
and as a base character. The ISO-Latin standards were ambiguous about this,
but Unicode is not.

> and some characters that Unicode defines a decomposition of, is not a
> composed character in some countries.

Huh? I don't understand what you're trying to say here at all. What
do countries have to do with this?

> So in some contexts is is wrong to decompose some characters that
> could be ok to decompose in others.

I assume you're talking only about compatibility decompositions (by
definition, canonically equivalent strings are always semantically
equivalent in all contexts). If so then I agree; IMHO it really only makes
sense to use the NFKC/D decompositions to convert Unicode text containing
compatibility characters to marked-up text, not as a normalisation form as
such.

> That is one reason I prefer NFC as it do not decompose characters.

NFC does decompose some characters (those in the "composition exclusions"
list).

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPS/OQzkCAxeYt5gVAQGHqAgAhcVbLT+Qebk8l8zVt94oHb9q2c+0Ddpf
QyftnzaWxERSDkac1N3IFSJTYs+MFmMjxwaGzavN1+U1mzKSJiDTNyOOc5RUb0of
4ctxFzjAYB+cizW0w6Kl8G3GT/iAk0EkpdDCmOozt85i99M/n4NdeEQyE/PlYuzg
XtC59f+uTCjXlxf19ko4Oel512b+lFQG4yBAgzK74KGhLJ9E6rZ4S5HQfZJVDawP
QLRmZ2s+PhGoy5aPekkPzzSdFy5tdaMFA5rMz6gzy7o0g8SiAtQgWS83FJ40NTzi
4Ec4sHd7tpgtkmb+0mFtDqqhkr7akcEjYB/VhuQZdBs/31kJV+3n/w==
=A96M
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Sat Jul 13 2002 - 00:56:59 EDT