Re: XML and ISO 10646 planes beyond the BMP

From: Misha Wolf (misha.wolf@reuters.com)
Date: Mon Aug 11 1997 - 07:39:02 EDT


I've just sent the appended mail to the XML working group.

Misha

~~~

Background info
----------------

ISO 10646 supports a large number of 64K planes. Unicode 2.0 supports
the first seventeen planes: plane 0, aka the Basic Multilingual Plane
(BMP), and the next sixteen planes (planes 1-16). Encoding schemes
vary in the planes they support:

     UCS-4 All planes

     UTF-8 All planes

     UTF-16 Planes 0-16

     UCS-2 Plane 0

UTF-16 encodes all BMP characters identically to UCS-2. Thus the only
difference between UTF-16 and UCS-2 is that UCS-2 cannot encode planes
1-16.

Problem
--------

The XML spec is self-contradictory in regard to ISO 10646 planes beyond
the BMP. On the one hand:

L1. Section 4.3.3 includes the sentence: "It is recognised that ...
     the use of additional planes ... may be required.".

L2. Section 4.3.3 includes the sentence: "The values ... UTF-16 ...
     should be used for the various encodings ... of ... ISO/IEC 10646
     ...".

On the other hand:

R1. The SGML Declaration for XML is based on UCS-2, ie admits only the
     BMP.

R2. UTF-16 is, I think, mentioned nowhere else in the XML spec.

Proposed solution
------------------

S1. Both in the "SGML Declaration for XML" and in the box labelled
     "Scope Document", amend the BASESET to reference IR-177 (ie UCS-4)
     rather than IR-176 (ie UCS-2). This will bring the XML spec into
     line with the HTML 4.0 spec.

S2. In both these places, amend the DESCSET accordingly: In the "SGML
     Declaration for XML", change the "65376" to "2147483488". In the
     box labelled "Scope Document", amend the "65536" to " 2147483648".
     The SGML Declaration for HTML 4.0 has "2147483486" rather than
     "2147483488" (see the least significant digit), but I'll seek to
     have that changed.

S3. Replace *all* other instances of "UCS-2" with "UTF-16".

S4. Add a note explaining that UCS-2 is a subset of UTF-16.

------------------------------------------------------------------------
Misha Wolf Email: misha.wolf@reuters.com 85 Fleet Street
Standards Manager Voice: +44 171 542 6722 London EC4P 4AJ
Reuters Limited Fax : +44 171 542 8314 UK
------------------------------------------------------------------------
Eleventh International Unicode Conference, Sep 2-5 1997, www.unicode.org

------------------------------------------------------------------------
Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT