surrogate terminology (was Re: Surrogate support in *ML?

From: Peter_Constable@sil.org
Date: Tue Sep 12 2000 - 11:57:43 EDT


>> What is confusing is that sometimes "surrogates" refer to
>> certain code units (for UTF-16) that are reserved as code points,
>> and sometimes "surrogates" is used to refer to 'characters
>> on planes 01-10'. I think the latter is a misuse.

>Good point. In the past, I have used "surrogate characters" to refer to
the
>characters encoded above FFFF, and surrogate code units to refer to the
UTF-16
>units D800-DFFF. However, I think that leads to confusion. Nobody has come
up
>with a good term for all characters above FFFF. "Plane 1-16 characters" is
>clunky and requires explanation, as does "non-BMP characters". Another
>possibility is "surrogate-pair characters". My personal favorite is
"astral
>characters" (don't remember who came up with that).

We do need to clean up terminology, and we need to do so in a way that
incorporates understanding of UTR-17. I think we need:

- BMP characters: characters in the BMP; note that d800-dfff are not
characters; fffe and ffff are also not characters
- "astral"/supplementary/extended-plane/?? characters: everything in planes
1 - 16 (excluding anything ending in fffe and ffff)
- codepoint: I'm inclined to use this as an alternate term for Unicode
Scalar Value; note that by this def'n d800 - dfff, fffe, etc. are *not*
codepoints
- code values: integers within the space of some encoding form; d800 - dfff
*are* code values, but not codepoints
- surrogate: I'm inclined to say that this should refer *only* to a UTF-16
code value in the range d800 - dfff; equal to "surrogate code value"
- surrogate pair: a valid pair of UTF-16 surrogate code values used to
encode an "astral" character; note that a surrogate pair is *different*
from the character they encode: surrogates come from the sphere of code
values, not the sphere of characters/codepoints

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT