Re: Surrogate space in Unicode

From: DougEwell2@cs.com
Date: Fri Feb 16 2001 - 01:29:25 EST


In a message dated 2001-02-15 15:26:55 Pacific Standard Time, john@nisus.com
writes:

> > At 2001-02-06 07:48:29 -0800 Mark Davis wrote:
> >> At 2001-02-06 01:51 "nikita k" <nikitakin@yahoo.com> wrote:
> >> What is surrogate space in unicode?
>
> (Mark defines various terms relating to 'supplementary' and 'surrogate')
>
> So, I guess it's safe to say that a surrogate code point is
> a surrogate code point... which is a surrogate for a supplementary
> code point, which is a code point between something and something
> else.
>
> Someone needs to take a break from the bureaucrateze and learn
> again how to communicate clearly. Is that not a part of the
> goal, here?

I thought Mark's definitions were both accurate and clear, unlike John's
rejoinder, which was neither.

It has proven difficult to come up with convenient terms for the Unicode
characters encoded at U+10000 and beyond. The term 'surrogate' has been
misused in an attempt to do this. It is important to use consistent terms
that demonstrate an understanding of what is going on.

I am not a member of the Consortium, and certainly would not consider myself
a bureaucrat, so I wil take a stab at this in the plainest English I can find
that does not sacrifice accuracy.

1. A Unicode 'code point' is a number between 0 and 1,114,111 inclusive,
usually expressed in hexadecimal (U+0000 through U+10FFFF). Not every code
point necessarily represents a valid character, although most do. For
example, there is no character encoded at U+FFFF.

2. A 'basic' code point, which may represent a 'basic character', can range
from U+0000 through U+FFFF. The remaining code points (U+10000 through
U+10FFFF) are 'supplementary' code points, each of which may represent a
'supplementary character'.

3. 'Surrogate' code points range from U+D800 through U+DFFF (not U+DC00).
They do not directly represent characters (so there is no such thing as a
'surrogate character'), but two of them may be used together according to the
rules of UTF-16 to represent a supplementary character. The two surrogate
code points used for this purpose would be called a 'surrogate pair'. Don't
separate them.

Is that better?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT