RE: UTF-16 clarification needed

From: Phillips, Addison (addison@amazon.com)
Date: Fri Jul 04 2008 - 10:31:43 CDT

Next message: Philippe Verdy: "RE: how to add all latin (and greek) subscripts"

Previous message: Michael Everson: "Re: Capital Sharp S in the News"
In reply to: Jeroen Ruigrok van der Werven: "Re: UTF-16 clarification needed"
Next in thread: Doug Ewell: "Re: UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

See Section 3.8 in the standard:

http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G2212

In my experience, it is a lot clearer to folks if you do not refer to surrogate code points as anything other than reserved. UTF-16 uses code units to encode Unicode code points.

Formally, the code points in Unicode run from 0 through 0x10FFFF, so the surrogate code points are code points. However the code points between D800 and DFFF are reserved and do not encode characters. Section 3.9 says:

"Each encoding form maps the Unicode code points U+0000..U+D7FF and
U+E000..U+10FFFF to unique code unit sequences."

So, the surrogate pair (of code units) encodes a code point (U+20045 in your example).

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
> On Behalf Of Jeroen Ruigrok van der Werven
> Sent: Friday, July 04, 2008 12:09 AM
> To: Doug Ewell
> Cc: Unicode Mailing List
> Subject: Re: UTF-16 clarification needed
>
> -On [20080704 08:47], Doug Ewell (dewell@roadrunner.com) wrote:
> >They are both UTF-16 code units and code points. They are not
> Unicode
> >scalar values.
>
> OK, and when you have them together in a surrogate pair, do you
> call it a
> pair of code units or can you also call them a pair of code points?
>
> --
> Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> /
> asmodai
> イェルーンラウフロックヴァンデルウェルヴェン
> http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
> A wise man that walks in the dark with a blindfold on, is not much
> of a
> wise man...

Next message: Philippe Verdy: "RE: how to add all latin (and greek) subscripts"
Previous message: Michael Everson: "Re: Capital Sharp S in the News"
In reply to: Jeroen Ruigrok van der Werven: "Re: UTF-16 clarification needed"
Next in thread: Doug Ewell: "Re: UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 10:34:39 CDT