Re: surrogate terminology

From: Mark Davis (markdavis@ispchannel.com)
Date: Wed Sep 13 2000 - 02:59:44 EDT


Not all code points are assigned (or even assignable) to characters. U+xxxxxx
is used to refer to code points, which range from 0 to 10FFFF. Of these code
points, some are assigned to characters (including regular characters, control
characters, format characters, and private use characters [whose interpretation
is a matter of private agreement]), some are assigned to noncharacters (e.g.
U+FFFF), some are assigned to surrogate area code points (U+D800..U+DFFF), and
some are as yet unassigned (e.g. U+20B0). You will see examples of this usage
of U+ all throughout the Unicode Standard.

People may use "U+2035" to refer to a character. In that case, it is understood
as referring to the abstract character that Unicode associates with that code
point. If I say "the character U+20B0", then I am, strictly speaking, in error,
since there is no character associated with that code point. It is a bit like
saying "the present king of France". I may be speaking loosely of the character
which is proposed for that code point in
(http://www.unicode.org/unicode/alloc/Pipeline.html), the GERMAN PENNY SYMBOL.

You are absolutely right that no one should be speaking of surrogate area code
points as "characters". They are not assigned to characters, and will never
be. The surrogate area code points are special -- they cannot be assigned to
characters, and their only use is to be reserved so that the corresponding code
units can be used in UTF-16 in pairs as a representation of the supplementary
characters (using that term for characters assigned to codepoints above FFFF).
They are, however, still code points.

People do use the term "character" ambiguously to refer to any of a number of
very different entities: abstract characters but also graphemes, glyphs, code
points, code units, bytes, etc. To avoid confusion, the broad and misleading
uses of the term "character" should be avoided; or at least one should clarify
which sense one is using when not absolutely obvious from the context.

Mark

Peter_Constable@sil.org wrote:

> On 09/12/2000 02:59:38 PM Kenneth Whistler wrote:
>
> [snip]
>
> I think Ken's comments on planes is good.
>
> >3. The term "surrogate character" should be eschewed altogether, because
> > of the confusion is causes. "Surrogate code point" can continue to
> > be used as it currently is, and the term "surrogate pair" is also
> > useful. But the other terminology related to characters...
>
> The other terminology Ken discussed had to do with the plane in which a
> character is found. What I think is still open is how d800 - dfff get
> referred to. Ken indicated that "surrogate code point" can continue in use
> as is; I don't recall exactly how TUS 3.0 uses it. (Would have made for a
> rather challenging trivia question :-) My biggest concern here is that
> people should not be referring to U+d800 - U+dfff as characters. (I'd be
> willing to accept code point, provided there is a clear statement as to
> what is meant by a code point.) For that matter, I'd be inclined to say
> that the U+ notation should not be used here - U+ should be reserved for
> use to refer to encoded characters in terms of their Unicode scalar values.
> So, 0xd800 is OK, but U+d800 would be wrong.
>
> - Peter
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT