Re: Abstract character?

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Tue Jul 23 2002 - 07:41:28 EDT


-----BEGIN PGP SIGNED MESSAGE-----

Mark Davis wrote:
> A small correction to Ken's message:
>
> > The Unicode scalar value
> > definitionally excludes D800..DFFF, which are only code unit
> > values used in UTF-16, and which are not code points associated
> > with any well-formed UTF code unit sequences.
>
> The UTC in has decided to make scalar value mean unambiguously the
> code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate
> code points.

I think it would be a mistake for the standard to refer to "surrogate
code points". The term "code point" is used for other CCS's where there
may also be gaps in the code space; in that case, the gaps are not
considered valid code points. When 0xD800..0xDFFF are used in UTF-16,
they are used as code units, not code points. As Unicode code points,
0xD800..0xDFFF are (or at least should be) invalid in the same sense
that 0x110000 is.

I.e. IMHO "Unicode scalar value" and "Unicode code point" should be
synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10FFFF.
"code point" should be defined as an integer corresponding to an
encoded character in any CCS, not just Unicode.

> While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
> UTF-16.

The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO
allowing them as code points (i.e. allowing any process to conformantly
generate unpaired surrogates) is a really bad idea. The set of code
point sequences that are validly representable in each UTF should be
identical (which ensures that mappings between UTFs are bijective and
always succeed iff the input is valid in the source UTF).
I.e. U+D800..DFFF, like U+110000, should be undesignated and
unrepresentable.

(As well as UTF-16, the definition of UTF-32 in UAX #19 does not
specifically exclude 0xD800..0xDFFF, although the ISO 10646 definition
does. In this case I think Unicode should be changed to be consistent
with ISO 10646.)

> Ken is pushing for this change; I believe it would be a very bad idea.

What precisely do you think would be a bad idea?

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPT0/MjkCAxeYt5gVAQEOvQf8DEmtbZpQ59nSSbVa8HN/BXCoMG/UOqYy
lSknQ+dUaIS3S0QgpVSIs5tFOjShw2YZ117cXioxzADMbU2MlbY3NITJYkatbgqf
UWIH9ENnqe0YDLdg1FWjyFFWuYLz1kf7c4M16OblhrHMJCjc9+Gba8dikIjJolWi
WNtzfX9ftuzcvFwssReGjyemXMhN6ugeUv3T1hGXjMRT834rSG9eLEr98BWpE1xR
m8wQPBWizSCDF3xFrRg6SwfSt1g+SrhGjLd/ccG96ENdC1XBHYyF4WgggdIO6Ilb
0WSaLbBV4uEPxyPihsy4pV3w8GLRXDhwpK34InLRHJFkMcgNWMTE2w==
=Kn1u
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 05:28:23 EDT