RE: Origin of the U+nnnn notation

From: Hohberger, Clive (
Date: Tue Nov 08 2005 - 09:00:45 CST

  • Next message: Philippe Verdy: "Re: Origin of the U+nnnn notation"

    Adding to Philippe's excellent description, I think of the set {U+nnnn}
    as a set of ordinal numbers, as they represent positions in a table. The
    construct U-nnnn therefore is meaningless as an ordinal number.

    -----Original Message-----
    From: [] On
    Behalf Of Philippe Verdy
    Sent: Tuesday, November 08, 2005 8:05 AM
    To: Dominikus Scherkl; 'Jukka K. Korpela';
    Subject: Re: Origin of the U+nnnn notation

    From: "Dominikus Scherkl" <>
    >> I have been unable to hunt down the historical origin of the
    >> notation U+nnnn (where nnnn are hexadecimal digits) that we
    >> use to refer to characters (and code points).
    >> Presumably "U" stands for "UCS" or for "Unicode", but where
    >> does the plus sign come from?
    > Maybe it was thought of as an offset from the unit (character null)
    > like in ETA+5 minutes (expected time of arrival was passed five
    > ago - an euphemism for beeing 5 minutes late).

    U-nnnn already exists (or I should say, it has existed). It was refering
    16-bit code units, not really to characters and was a fixed-width
    (with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).

    U+[n...n]nnnn was created to avoid the confusion with the past 16-bit
    Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646
    points). It is a variable-width notation that refers to ISO/IEC 10646
    points. The "U" means "UCS" or "Universal Character Set". At that time,
    UCS code point range was up to 31 bits wide.

    The U-nnnn notation is abandoned now, except for references to Unicode
    If one uses it, it will refer to one or more 16-bit code units needed to

    encode each codepoint (possibly with surrogate pairs). It does not
    designates abstract characters or codepoints unambiguously.

    Later, the variable-width U+[n...n]nnnn notation was restricted to allow

    only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and

    Unicode standards (so the only standard codepoints are between U+0000
    U+10FFFF, some of them being permanently assigned to non-characters).

    The references to larger code points with U+[n...n]nnnn is discouraged,
    they no longer designate valid code points in both standards. Their
    definition and use is then application-specific.

    There are '''no''' negative codepoints in either standards (U-0001 does
    designate the 32-bit code unit that you could store in a signed
    datatype, but in past standard it designated the same codepoint as
    now). Using "+" makes the statement about signs clear: standard code
    all have positive values.

    So if you want a representation for negative code units, you need
    notation (for example N-0001 to represent the negative code unit with
    negative value -1): this notation is application-specific.
    This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.

    This archive was generated by hypermail 2.1.5 : Tue Nov 08 2005 - 09:02:42 CST