RE: UTF-16 clarification needed

From: Phillips, Addison (
Date: Fri Jul 04 2008 - 10:31:43 CDT

  • Next message: Philippe Verdy: "RE: how to add all latin (and greek) subscripts"

    See Section 3.8 in the standard:

    In my experience, it is a lot clearer to folks if you do not refer to surrogate code points as anything other than reserved. UTF-16 uses code units to encode Unicode code points.

    Formally, the code points in Unicode run from 0 through 0x10FFFF, so the surrogate code points are code points. However the code points between D800 and DFFF are reserved and do not encode characters. Section 3.9 says:

    "Each encoding form maps the Unicode code points U+0000..U+D7FF and
    U+E000..U+10FFFF to unique code unit sequences."

    So, the surrogate pair (of code units) encodes a code point (U+20045 in your example).


    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    > -----Original Message-----
    > From: []
    > On Behalf Of Jeroen Ruigrok van der Werven
    > Sent: Friday, July 04, 2008 12:09 AM
    > To: Doug Ewell
    > Cc: Unicode Mailing List
    > Subject: Re: UTF-16 clarification needed
    > -On [20080704 08:47], Doug Ewell ( wrote:
    > >They are both UTF-16 code units and code points. They are not
    > Unicode
    > >scalar values.
    > OK, and when you have them together in a surrogate pair, do you
    > call it a
    > pair of code units or can you also call them a pair of code points?
    > --
    > Jeroen Ruigrok van der Werven <asmodai(-at-)> /
    > asmodai
    > イェルーン ラウフロック ヴァン デル ウェルヴェン
    > | | GPG: 2EAC625B
    > A wise man that walks in the dark with a blindfold on, is not much
    > of a
    > wise man...

    This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 10:34:39 CDT