Re: U+xxxx, U-xxxxxx, and the basics

From: Kenneth Whistler (
Date: Fri Mar 03 2000 - 21:06:09 EST


> Here is a way of representing the abstract character itself, using its
> scalar value:
> * in Unicode notation: U-00212B

ISO/IEC 10646-1:2000, Clause 6.5 Identifiers for characters (( derived from
Amendment 9 )) specifies the following syntax for the "short identifier":

     { U | u }[ {+}xxxx | {-}xxxxxxxx ]

That implies the following options:

     212B U212B u212B +212B U+212B u+212B
     0000212B U0000212B u0000212B -0000212B U-0000212B u-0000212B

The editors of the Unicode Standard chose not to make use of all of those
options -- particularly the forms prefixed merely with "+" or "-", which
look confusingly like signed integers. The array of options used in
the Unicode Standard, as documented in the Notations section are:

     212B U+212B
     0000212B U-0000212B

Note that the "U-" notation of (what will come to be called) the UTF-32
form always uses 8 hex digits. It is conceivable that five and/or six
digit hex forms will be introduced in the near future, since nobody really
wants to keep writing all the extra leading zeroes. But as it stands
currently, 5- or 6-digit shortened forms are not officially used in the
documentation for the standard.

> In UTF-16, each 16-bit code value in the 0x0..0xC7FF range and the
> 0xD800..0xFFFF range directly corresponds to the same scalar value, while a
> "surrogate" pair of 16-bit code values algorithmically represents a single
> scalar value in the range 0x010000..0x10FFFF. The first half of the pair is
> always in the 0xD000..0xD7FF range, and the second half of the pair is in
> the 0x0..0xFFFF range. Unicode 3.0 and ISO/IEC 10646-1;2000 have adopted the
> UTF-16 mechanism as the only official usage of the 0xD000..0xD7FF scalar
> range.

> Here are various ways of representing the proposed abstract character named
> "GOTHIC LETTER Q" (which will probably be assigned to the Unicode scalar
> value 0x10335):
> * in Unicode notation, by its Unicode scalar value: U-010335
> * as a UCS-4 code value sequence, in hex notation: 0x00010335
> * as a UCS-2 code value sequence: illegal; out of range
> * as a UTF-16 code value sequence, in hex notation: 0xD800 0x0336
                                                         0xD800 0xDF35
> * in Unicode notation, by its Unicode value pair: U+D800 U+0336
                                                       U+D800 U+DF35
> * in EBNF notation: \u212B \u0336
                         \uD800 \uDF35
> * as a UTF-8 code value sequence, in hex notation: 0xF0 0x90 0x8c 0xB5

Other than these fixes, this text looked quite accurate to me.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT