RE: surrogate terminology (was Re: Surrogate support in *ML?

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Tue Sep 12 2000 - 12:55:07 EDT


> From: Peter_Constable@sil.org [mailto:Peter_Constable@sil.org]

>
> >> What is confusing is that sometimes "surrogates" refer to
> >> certain code units (for UTF-16) that are reserved as code points,
> >> and sometimes "surrogates" is used to refer to 'characters
> >> on planes 01-10'. I think the latter is a misuse.
>
> >Good point. In the past, I have used "surrogate characters"
> to refer to
> the
> >characters encoded above FFFF, and surrogate code units to
> refer to the
> UTF-16
> >units D800-DFFF. However, I think that leads to confusion.
> Nobody has come
> up
> >with a good term for all characters above FFFF. "Plane 1-16
> characters" is
> >clunky and requires explanation, as does "non-BMP
> characters". Another
> >possibility is "surrogate-pair characters". My personal favorite is
> "astral
> >characters" (don't remember who came up with that).

        I think this is the only memorable term I've heard so far, which
alone should recommend it.

> We do need to clean up terminology, and we need to do so in a way that
> incorporates understanding of UTR-17. I think we need:
>
> - BMP characters: characters in the BMP; note that d800-dfff are not
> characters; fffe and ffff are also not characters

        Can we take this one further and say "basic characters"? I've got
enough TLAs floating about already... ;-)

> - "astral"/supplementary/extended-plane/?? characters:
> everything in planes
> 1 - 16 (excluding anything ending in fffe and ffff)

        "high characters"?

> - codepoint: I'm inclined to use this as an alternate term for Unicode
> Scalar Value; note that by this def'n d800 - dfff, fffe, etc.
> are *not*
> codepoints
> - code values: integers within the space of some encoding
> form; d800 - dfff
> *are* code values, but not codepoints

        This is, to me, counterintuitive. I would be inclined to say "at
point d800 there is no valid value" rather than vice versa. I would
consider all enumerable integers in the code space to be code points -
whether or not there is anything actually at that point (i.e. fffe and ffff
are valid, but unused, codepoints).

        On second glance I see that you want to use the word "code" to mean
two things. I suspect such wording will cause the same confusion in others
that it caused in me. Combining your suggestion with mine (in the paragraph
above), I suggest:

code point: integers within the space of some encoding form
code value: the meaning assigned to a code point or code points
Unicode point: a value in the range 0-0x10ffff which may or may not have a
meaning assigned
Unicode value: Unicode Scalar Value, a fancy way to say "character"

        This way, we say "Unicode" to refer to the CCS and "code" to refer
to the CEF. If we want to go a level higher and talk about the CES, I
suggest "byte points" and "byte values". I believe that we won't wish to
discuss TES in other than abstract fashion.

        I admit I am going a little deeper than I am truly familiar with, so
please accept my apologies if I got this all mixed up.

> - surrogate: I'm inclined to say that this should refer
> *only* to a UTF-16
> code value in the range d800 - dfff; equal to "surrogate code value"

        The problem here is a clash with the traditional definition of
"surrogate", which would be much closer to your "surrogate pair" below.
Can't we call these "surrogate prefixes"?

> - surrogate pair: a valid pair of UTF-16 surrogate code values used to
> encode an "astral" character; note that a surrogate pair is
> *different*
> from the character they encode: surrogates come from the
> sphere of code
> values, not the sphere of characters/codepoints

/|/|ike



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT