RE: Surrogate support in *ML?

From: Karlsson Kent - keka (keka@im.se)
Date: Thu Sep 07 2000 - 11:19:42 EDT


> From: Brendan Murray/DUB/Lotus [mailto:Brendan_Murray@Lotus.com]
...
> Karlsson Kent - keka <keka@im.se> wrote:
> > At the level of XML the number of bits is irrelevant.
> > The "high and low surrogate" code points are excluded
> > from being used as NCRs. A character (not UTF-16 code
> > units) can be referenced by NCRs. See (XML) procuction 66
> > (CharRef) and its well-formedness constraint (and
> > production 2 (Char), though they missed to exclude a number
> > of other non-character code points in that production).
>
> I know that XML explicitly excludes surrogates. My question really refers
> to what one can do to encode the non-BMP data in the new Han unification
> data that will become part of 10646 and Unicode in the not too distant
> future: is this huge block of characters regarded as irrelevant, or has
> anyone proposed an encoding that can be used?

As was apparently not clear enough from my answer is that
you refer to the code point for the character. Thus,
assuming the following example characters pass and stay at
the currently suggested code points, &#x10330; will refer
to GOTHIC LETTER AHSA in plane 1, &#x2A718; will refer to
CJK UNIFIED IDEOGRAPH-2A718 (which is in extension B on
plane 2), and so on.

This should be clear from (XML) production 66 (CharRef)
and its well-formedness constraint, that refers to
(XML) production 2 (Char), that in turn does include planes
01-10 (hex) (even though that production mistakenly includes
32 not-a-character code points on the supplementary planes).

In addition, XML processors must 'support' both UTF-8 and
UTF-16 (not just UCS-2). However, independently of document
encoding, character references (CharRef) always refer to UCS
code points (a.k.a. scalar values), not (UTF-16, UTF-8, or other)
code units.

What is confusing is that sometimes "surrogates" refer to
certain code units (for UTF-16) that are reserved as code points,
and sometimes "surrogates" is used to refer to 'characters
on planes 01-10'. I think the latter is a misuse.

        /kent k



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT