Re: Surrogate support in *ML?

From: Mark Davis (markdavis@ispchannel.com)
Date: Fri Sep 08 2000 - 10:10:52 EDT


Good point. In the past, I have used "surrogate characters" to refer to the
characters encoded above FFFF, and surrogate code units to refer to the UTF-16
units D800-DFFF. However, I think that leads to confusion. Nobody has come up
with a good term for all characters above FFFF. "Plane 1-16 characters" is
clunky and requires explanation, as does "non-BMP characters". Another
possibility is "surrogate-pair characters". My personal favorite is "astral
characters" (don't remember who came up with that).

Mark

Karlsson Kent - keka wrote:

> > From: Brendan Murray/DUB/Lotus [mailto:Brendan_Murray@Lotus.com]
> ...
> > Karlsson Kent - keka <keka@im.se> wrote:
> > > At the level of XML the number of bits is irrelevant.
> > > The "high and low surrogate" code points are excluded
> > > from being used as NCRs. A character (not UTF-16 code
> > > units) can be referenced by NCRs. See (XML) procuction 66
> > > (CharRef) and its well-formedness constraint (and
> > > production 2 (Char), though they missed to exclude a number
> > > of other non-character code points in that production).
> >
> > I know that XML explicitly excludes surrogates. My question really refers
> > to what one can do to encode the non-BMP data in the new Han unification
> > data that will become part of 10646 and Unicode in the not too distant
> > future: is this huge block of characters regarded as irrelevant, or has
> > anyone proposed an encoding that can be used?
>
> As was apparently not clear enough from my answer is that
> you refer to the code point for the character. Thus,
> assuming the following example characters pass and stay at
> the currently suggested code points, &#x10330; will refer
> to GOTHIC LETTER AHSA in plane 1, &#x2A718; will refer to
> CJK UNIFIED IDEOGRAPH-2A718 (which is in extension B on
> plane 2), and so on.
>
> This should be clear from (XML) production 66 (CharRef)
> and its well-formedness constraint, that refers to
> (XML) production 2 (Char), that in turn does include planes
> 01-10 (hex) (even though that production mistakenly includes
> 32 not-a-character code points on the supplementary planes).
>
> In addition, XML processors must 'support' both UTF-8 and
> UTF-16 (not just UCS-2). However, independently of document
> encoding, character references (CharRef) always refer to UCS
> code points (a.k.a. scalar values), not (UTF-16, UTF-8, or other)
> code units.
>
> What is confusing is that sometimes "surrogates" refer to
> certain code units (for UTF-16) that are reserved as code points,
> and sometimes "surrogates" is used to refer to 'characters
> on planes 01-10'. I think the latter is a misuse.
>
> /kent k



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT