Re: XML and ISO 10646 planes beyond the BMP

From: Misha Wolf (
Date: Thu Aug 14 1997 - 07:44:31 EDT

Ken Whistler mailed (the unicode list) a derogatory remark about
numerologists. Well, you haven't seen anything yet. Liam Quin informs
us below that numbers in a DESCSET are limited to *8 digits*. I've heard
of the the 7 days of the week and the 12 apostles but not of the 8 digit
SGML limit. Presumably, the SGML designers consulted a different burning

Doing some further numerology, I checked how many digits we need to
accommodate the 17 planes (!) of Unicode:

   (256 * 256 * 17) - 1 = 1114111

Above I've subtracted 1, as characters are numbered from 0 and it makes a
pretty pattern.

For the DESCSET itself, we need to subtract 160 from (256 * 256 * 17):

   (256 * 256 * 17) - 160 = 1113952

So that weighs in comfortably under 8 digits. What a relief!



I agree with Misha. Certainly XML should not be restricted to
what the theosophists may call the Base Material Plane.

> S1. Both in the "SGML Declaration for XML" and in the box labelled
> "Scope Document", amend the BASESET to reference IR-177 (ie UCS-4)
> rather than IR-176 (ie UCS-2). This will bring the XML spec into
> line with the HTML 4.0 spec.
Yes, this is sensible. Also, not being able to represent HTML in XML
because of using a differnt character set would be plain silly.

> S2. In both these places, amend the DESCSET accordingly: In the "SGML
> Declaration for XML", change the "65376" to "2147483488". In the
> box labelled "Scope Document", amend the "65536" to " 2147483648".
> The SGML Declaration for HTML 4.0 has "2147483486" rather than
> "2147483488" (see the least significant digit), but I'll seek to
> have that changed.
I understood that there was an 8 character limit on the length of these
numbers, so that 99999999 would be the largest possible - is that no
the case? Where's Dave Pederson when you need him?
See p. 346 in the SGML Handbook, for example. Every Number is in
fact a Name, hence NAMECASE and NAMELEN apply -- this is explicitly stated
in 9.3.1 Quantities. Since the Reference Concrete Syntax applies in
the SGML declaration (p. 451, or Clause 13 of ISO 8879:1986), NAMELEN is
8, so numbers are limited to 8 digits.

Frankly I'd like to see a default NAMELEN of 0 as meaning unlimited,
or limited only by available system resources, but that's not the
SGML of today.

So XML can't quite reach the far heights of the upper planes just yet.
And the HTML 4 declaration sounds as if it's bogus.

What is the best compromise?


Liam Quin --  the barefoot typographer -- Toronto
lq-text: freely available Unix text retrieval

email address: liamquin, at host: interlog dot com

------------------------------------------------------------------------ Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT