Re: Why are the low surrogates numerically larger than the high surrogates? from Markus Scherer on 2013-01-23 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Wed, 23 Jan 2013 10:04:13 -0800

On Wed, Jan 23, 2013 at 9:45 AM, Costello, Roger L. <costello_at_mitre.org>wrote:

> Hi Folks,
>
> The book Unicode Demystified says this (page 190, first paragraph):
>
> The surrogate range is divided in half.
> The range from U+D800 to U+DBFF contains
> the "high surrogates," and the range from
> U+DC00 to U+DFF contains the "low surrogates."
>
> Why are the low surrogates numerically larger than the high surrogates?
>
> That is, why isn't U+D800 to U+DBFF called the low surrogates and U+DC00
> to U+DFF called the high surrogates?
>

The high surrogates contain the high-order bits of the code point, and the
low surrogates contain the low-order bits.
(The last one is U+DFFF not U+DFF of course.)

In the Unicode Technical Report #36, Unicode Security Considerations [1] it
> says:
>
> PEP 383 takes this approach. It enables lossless
> conversion to Unicode by converting all "unmappable"
> sequences to a sequence of one or more isolated
> high surrogate code points. That is, each unmappable
> byte's value is a code point whose value is 0xDC00
> plus byte value.
>
> Notice "high surrogate" in that quote. I'm confused. I thought the low
> surrogate range started at 0xDC00, but this document is saying that 0xDC00
> + byte value = high surrogate. Is that a typo in the document?
>

Yes, that looks wrong. I don't know which PEP 383 actually uses.
Please submit a bug report via http://www.unicode.org/reporting.html

markus
Received on Wed Jan 23 2013 - 12:07:32 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 23 2013 - 12:07:33 CST