Re: surrogate terminology

From: Peter_Constable@sil.org
Date: Wed Sep 13 2000 - 14:10:53 EDT


On 09/13/2000 01:47:57 AM Mark Davis wrote:

>Not all code points are assigned (or even assignable) to characters.
U+xxxxxx
>is used to refer to code points, which range from 0 to 10FFFF. Of these
code
>points, some are assigned to characters (including regular characters,
control
>characters, format characters, and private use characters [whose
interpretation
>is a matter of private agreement]), some are assigned to noncharacters
(e.g.
>U+FFFF), some are assigned to surrogate area code points (U+D800..U+DFFF),
and
>some are as yet unassigned (e.g. U+20B0). You will see examples of this
usage
>of U+ all throughout the Unicode Standard.

[snip]

>You are absolutely right that no one should be speaking of surrogate area
code
>points as "characters"....

[snip]

>People do use the term "character" ambiguously to refer to any of a number
of
>very different entities: abstract characters but also graphemes, glyphs,
code
>points, code units, bytes, etc. To avoid confusion, the broad and
misleading
>uses of the term "character" should be avoided; or at least one should
clarify
>which sense one is using when not absolutely obvious from the context.

The main concern I have in mind is that people get confused by thinking of
Unicode as a uniformly 16-bit encoding standard, but then having to
understand "surrogate characters", which have generally been described as
characters represented using a special pair of codepoints. But then it's
easy to also get confused as to whether those special code points
individually correspond to characters or not. Talking about a
supplementary-plane character in terms of U+d800 U+dc00 doesn't make this
as clear as it could be. I'd suggest that U+0000 - U+10FFFF should refer to
Unicode scalar values in the space of a CCS, in which case U+d800 and
U+dc00 are unused. But if were talking about the space of data values used
within UTF-16 where these need to be distinguished from USVs, then use
0xd800 notation. It's a subtle point, but I think it would be helpful
precisely in helping people understand the relationship between USVs, the
UTF-16 encoding form, and surrogates in particular.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT