> From: Peter_Constable@sil.org [mailto:Peter_Constable@sil.org]
> >> What is confusing is that sometimes "surrogates" refer to
> >> certain code units (for UTF-16) that are reserved as code points,
> >> and sometimes "surrogates" is used to refer to 'characters
> >> on planes 01-10'. I think the latter is a misuse.
> >Good point. In the past, I have used "surrogate characters"
> to refer to
> >characters encoded above FFFF, and surrogate code units to
> refer to the
> >units D800-DFFF. However, I think that leads to confusion.
> Nobody has come
> >with a good term for all characters above FFFF. "Plane 1-16
> characters" is
> >clunky and requires explanation, as does "non-BMP
> characters". Another
> >possibility is "surrogate-pair characters". My personal favorite is
> >characters" (don't remember who came up with that).
I think this is the only memorable term I've heard so far, which
alone should recommend it.
> We do need to clean up terminology, and we need to do so in a way that
> incorporates understanding of UTR-17. I think we need:
> - BMP characters: characters in the BMP; note that d800-dfff are not
> characters; fffe and ffff are also not characters
Can we take this one further and say "basic characters"? I've got
enough TLAs floating about already... ;-)
> - "astral"/supplementary/extended-plane/?? characters:
> everything in planes
> 1 - 16 (excluding anything ending in fffe and ffff)
> - codepoint: I'm inclined to use this as an alternate term for Unicode
> Scalar Value; note that by this def'n d800 - dfff, fffe, etc.
> are *not*
> - code values: integers within the space of some encoding
> form; d800 - dfff
> *are* code values, but not codepoints
This is, to me, counterintuitive. I would be inclined to say "at
point d800 there is no valid value" rather than vice versa. I would
consider all enumerable integers in the code space to be code points -
whether or not there is anything actually at that point (i.e. fffe and ffff
are valid, but unused, codepoints).
On second glance I see that you want to use the word "code" to mean
two things. I suspect such wording will cause the same confusion in others
that it caused in me. Combining your suggestion with mine (in the paragraph
above), I suggest:
code point: integers within the space of some encoding form
code value: the meaning assigned to a code point or code points
Unicode point: a value in the range 0-0x10ffff which may or may not have a
Unicode value: Unicode Scalar Value, a fancy way to say "character"
This way, we say "Unicode" to refer to the CCS and "code" to refer
to the CEF. If we want to go a level higher and talk about the CES, I
suggest "byte points" and "byte values". I believe that we won't wish to
discuss TES in other than abstract fashion.
I admit I am going a little deeper than I am truly familiar with, so
please accept my apologies if I got this all mixed up.
> - surrogate: I'm inclined to say that this should refer
> *only* to a UTF-16
> code value in the range d800 - dfff; equal to "surrogate code value"
The problem here is a clash with the traditional definition of
"surrogate", which would be much closer to your "surrogate pair" below.
Can't we call these "surrogate prefixes"?
> - surrogate pair: a valid pair of UTF-16 surrogate code values used to
> encode an "astral" character; note that a surrogate pair is
> from the character they encode: surrogates come from the
> sphere of code
> values, not the sphere of characters/codepoints
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT