Re: XML and ISO 10646 planes beyond the BMP

From: Misha Wolf (misha.wolf@reuters.com)
Date: Wed Aug 13 1997 - 12:56:01 EDT


In this mail, I'm trying to deal with inter-related issues relevant to
three mailing lists: xml, html and unicode. First an extract from the
HTML 4.0 draft spec:

   10.1.2 The SGML Declaration

   <!SGML "ISO 8879:1986"
   --
        SGML Declaration for HyperText Markup Language version 4.0

        With support for Unicode UCS-4 and increased limits
        for tag and literal lengths etc.
   --

   CHARSET
            BASESET "ISO Registration Number 177//CHARSET
                      ISO/IEC 10646-1:1993 UCS-4 with
                      implementation level 3//ESC 2/5 2/15 4/6"
            DESCSET 0 9 UNUSED
                     9 2 9
                     11 2 UNUSED
                     13 1 13
                     14 18 UNUSED
                     32 95 32
                     127 1 UNUSED
                     128 32 UNUSED
                     160 2147483486 160
   --
       In ISO 10646, the positions with hexadecimal
       values 0000D800 - 0000DFFF, used in the UTF-16
       encoding of UCS-4, are reserved, as well as the last
       two code values in each plane of UCS-4, i.e. all
       values of the hexadecimal form xxxxFFFE or xxxxFFFF.
       These code values or the corresponding numeric
       character references must not be included when
       generating a new HTML document, and they should be
       ignored if encountered when processing a HTML
       document.
   --

The meanings of the three columns [let us call them A, B and C] of the
DESCSET are (if you are an SGML expert, please feel free to correct me):

   B characters, starting at offset A in the document character set, are
   defined by B characters, starting at offset C in the base character
   set.

In the case of HTML 4.0, both the document character set and the base
character set are ISO 10646. The XML spec is confused in that it refers
to UCS-2 as the BASESET, yet speaks of ISO 10646 planes beyond the BMP.
Further confusion is caused by the difference between:

   3.2.1: Coded Character Set

   A Coded Character Set (CCS) is a mapping from a set of abstract
   characters to a set of integers. Examples of coded character sets
   are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
   [ISO-8859].

   3.2.2: Character Encoding Scheme

   A Character Encoding Scheme (CES) is a mapping from a Coded Character
   Set or several coded character sets to a set of octets. Examples of
   Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
   A given CES is typically associated with a single CCS; for example,
   UTF-8 applies only to ISO 10646.

The above quote is taken from RFC 2130, "The Report of the IAB Character
Set Workshop held 29 February - 1 March, 1996".

The BASESET should logically be a Coded Character Set, not a Character
Encoding Scheme. The HTML 2.0 spec contains an example of this:

   CHARSET
         BASESET "ISO 646:1983//CHARSET
                  International Reference Version
                  (IRV)//ESC 2/5 4/0"

         DESCSET 0 9 UNUSED
                  9 2 9
                  11 2 UNUSED
                  13 1 13
                  14 18 UNUSED
                  32 95 32
                  127 1 UNUSED

         BASESET "ISO Registration Number 100//CHARSET
                  ECMA-94 Right Part of
                  Latin Alphabet Nr. 1//ESC 2/13 4/1"

         DESCSET 128 32 UNUSED
                  160 96 32

The second BASESET above is clearly a Coded Character Set, not a
Character Encoding Scheme. The characters in this Coded Character Set
are numbered from 32 (decimal). When this Coded Character Set is made
into a Character Encoding Scheme, character 32 is typically encoded as
160 (decimal).

At the moment, both HTML 4.0 and XML are using Character Encoding
Schemes in their BASESET declarations. One is using UCS-4, the other is
using UCS-2. I am working to get this changed by getting a new
registration into the International Register, which:

   1. corresponds to ISO 10646/Unicode as a Coded Character Set, not
       to any particular Character Encoding Scheme, and

   2. corresponds to ISO 10646/Unicode after Amendments 1-7 and
       includes all future Amendments which add characters but do not
       change, move or remove them.

Finally, an extract from ISO 2375, which governs the International
Register. It sheds light on the possibility of getting an open-ended
registration accepted:

   8 Revision procedure

   8.1 In general no changes to registrations are permitted, ...

   8.2 The Registration Authority may exceptionally grant a waiver to
        international, governmental organisations issuing
        internationally recognised and world-wide implemented standards.
        However, the possibility that a registration may be modified in
        future without allocation of a new escape sequence shall be
        mentioned in the first application papers and in the register.

------------------------------------------------------------------------
Misha Wolf Email: misha.wolf@reuters.com 85 Fleet Street
Standards Manager Voice: +44 171 542 6722 London EC4P 4AJ
Reuters Limited Fax : +44 171 542 8314 UK
------------------------------------------------------------------------
Eleventh International Unicode Conference, Sep 2-5 1997, www.unicode.org

------------------------------------------------------------------------
Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT