Re: UTF-16 clarification needed

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 07 2008 - 15:02:04 CDT

Next message: J: "Re: Getting A Newb Started"

Previous message: Michael Everson: "Re: Normalisation and directionality (was: how to add all latin (and greek) subscripts)"
Maybe in reply to: Jeroen Ruigrok van der Werven: "UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell responded:

> But they do exist as ***code points***. TUS is clear there too, in
> definitions D9 and D10.

Correct.

>
> I'd like to wait for Ken or Mark or somebody to issue a bull on this.

*hehe*

> I
> think I gave the correct answer to the question Jeroen asked, and you
> are giving the correct answer for the question you think Jeroen really
> meant to ask.

I think Addison answered the question well, actually. There isn't
a whole lot to add to that, but I'll maunder on, anyway...

Jeroen's follow-up question was:

> OK, and when you have them together in a surrogate pair, do you call it a
> pair of code units or can you also call them a pair of code points?

The way to think about this clearly is to specify the *context* in
which you "have a surrogate pair".

If you are talking about a UTF-16 string, then what that string consists
of (if well-formed) is a sequence of UTF-16 code *units*. In that
context:

<0041 D840 DC45 0041>

is a sequence of 4 UTF-16 code units, two of which constitute a
well-formed surrogate pair.

That UTF-16 string can be *interpreted* as a sequence of Unicode
code points, namely:

<0041, 20045, 0041>

and from the standard, we know that the code point value 0041
(or U+0041) represents LATIN CAPITAL LETTER A and the code point
value 20045 (or U+20045) represents CJK UNIFIED IDEOGRAPH-20045.

In the context of your UTF-16 string, by the definition of UTF-16
(D91), the isolated code unit value of D840 by itself can*not*
be intrepreted as Unicode code point -- it is only part of the
surrogate pair, i.e. part of a sequence of bits that together
represent U+20045.

However, outside of the context of your UTF-16 string, and
considered in the context of the architecture of the overall
standard, U+D840 certainly *is* a code point. It has a designated
function, but that function requires that it never be assigned
an abstract character.

When you are talking about UTF-16 strings, however, you are
best to simply ignore the status of U+D840 in the overall
architecture. In UTF-16, D840 by itself is no more meaningful than
would be a BF byte value by itself in UTF-8.

--Ken

Next message: J: "Re: Getting A Newb Started"
Previous message: Michael Everson: "Re: Normalisation and directionality (was: how to add all latin (and greek) subscripts)"
Maybe in reply to: Jeroen Ruigrok van der Werven: "UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 15:03:36 CDT