Re: Astral planes (was: RE: Plane One use, was Re: HTML Validation)

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Wed Dec 19 2001 - 01:01:03 EST


-----BEGIN PGP SIGNED MESSAGE-----

Rick Cameron wrote:
> From: Asmus Freytag [mailto:asmusf@ix.netcom.com]
> >Of course, the Unicode Standard 3.0 doesn't even mention a 32-bit
> >encoding - but that's not stopping uniphiles from storing Unicode data
> >in their wchar_t's!
>
> The only way such use is conformant is if it follows UTF-32. The latter is
> clearly specified in http://www.unicode.org/unicode/reports/tr19/ as:
>
> "The following lists the important features of this encoding form:
>
> UTF-32 is restricted in values to the range 0..10FFFF, which precisely
> matches the range of characters defined in the Unicode Standard (and other
> standards such as XML), and those representable by UTF-8 and UTF-16. "

Well, that's not quite true: D800..DFFF are not representable in UTF-16.
I was under the impression that as of Unicode 3.2 they would not be
legally representable in UTF-8 or UTF-32 either (i.e. that all mappings
between UTFs would be bijections between the sets of legal strings, which
is a good thing).

Does the official definition of "character" include non-characters?

Also, I don't think the comment about XML is correct, taking into account
the word "precisely"; XML allows the following subset of code points
(from http://www.w3.org/TR/REC-xml#charsets):

  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
           [#xE000-#xFFFD] | [#x10000-#x10FFFF]

So, I think the above statement should be:

  UTF-32 is restricted in code point values to the ranges 0..D7FF and
  E000..10FFFF, which precisely matches the set of code points designated
  by the Unicode Standard (excluding surrogate code points), and those
  representable by UTF-8 and UTF-16.

  This set also matches the set of characters used in other standards
  such as XML and HTML 4.01, with the exception of some control codes and
  non-character codes.

Note "designated" instead of "defined" - is that the right term?

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPCAs3TkCAxeYt5gVAQHt9AgA0SAyzfJqWD/bEiOT6YXKHoRhj8f88eGu
2jWFubNiYXAj3RR3NZruIR61WUk0DVtIBXCCmhxBh0ZLIAzZguR2mlO7k6T0OpJk
h8qEBEMOeaCNLwrFGq7WKZRanznB9nuoG+OikO7FAQ0/VjeGk+9joJJLZDN8BxRO
8DXuvwgjOUynumlAp71fvQzgj20bTXT1y3ckh37ZKAH+3KWBB8Yrdw2n75n+05uq
AkL94iKSO+CWzNUUCKwXPEehI3/mV7y2mbjzCVquOQ+KF1/QIqMcLp5JdaP9OyEI
b1MBL2ezmAOVustJyh/ofWeM8Ykke0jvELrsjHRKvp2cpZ1PSSITpg==
=sa0a
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Wed Dec 19 2001 - 01:33:51 EST