FW: Why are the low surrogates numerically larger than the high surrogates?

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Wed, 23 Jan 2013 19:13:19 +0000

-----Original Message-----
From: ken.whistler_at_sap.com
Sent: Wednesday, January 23, 2013 10:48 AM
To: 'Costello, Roger L.'
Subject: RE: Why are the low surrogates numerically larger than the high surrogates?

 
> Why are the low surrogates numerically larger than the high surrogates?
>
> That is, why isn't U+D800 to U+DBFF called the low surrogates and U+DC00 to
> U+DFF called the high surrogates?

The terminology resulted from an analogy to the use of the terms "high byte" and "low byte" in integers in CS discussion.

For example, in the 16-bit value 0x03FF, 0x03 would be considered the high byte, i.e., the more significant byte of a two-byte, 16-bit integer, while 0xFF would be considered the low byte, i.e., the least significant byte of the integer. It doesn't matter that 0x03 < 0xFF; in this integer, 0x03 is still the *high* byte, because it is the more significant of the two bytes.

For Unicode code points expressed in UTF-16 with surrogate code pairs, in a sequence <0xD800, 0xDFFF>, 0xD800 is the "high surrogate", because it stands for the more significant part of the code point value, while 0xDFFF is the "low surrogate", because it stands for the less significant part of the code point value. Again, it doesn't matter that 0xD800 < 0xDFFF for this determination.

Unlike the high byte and low byte of a 16-bit value, the "surrogates" in UTF-16 really stand for the more significant and less significant 10-bit parts of a 20-bit number (which is then shifted one plane by adding 0x10000). There wasn't any widely accepted term like "byte" to stand for these 10-bit parts, and for whatever reason, the term "surrogate" was coined and used for them instead, in the context of UTF-16, particularly.

It may seem confusing, because in Unicode, high surrogate code units, considered on their own, always have smaller values than low surrogate code units, a design point that was put in place to prevent overlap in UTF-16. But if you just mentally convert "high" to "more significant" and "low" to "less significant", then the confusion should go away.

>
> In the Unicode Technical Report #36, Unicode Security Considerations [1] it
> says:
>
> PEP 383 takes this approach. It enables lossless
> conversion to Unicode by converting all "unmappable"
> sequences to a sequence of one or more isolated
> high surrogate code points. That is, each unmappable
> byte's value is a code point whose value is 0xDC00
> plus byte value.
>
> Notice "high surrogate" in that quote. I'm confused. I thought the low
> surrogate range started at 0xDC00, but this document is saying that 0xDC00 +
> byte value = high surrogate. Is that a typo in the document?

Yes.

--Ken
Received on Wed Jan 23 2013 - 13:15:54 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 23 2013 - 13:15:55 CST