Re: Surrogate pairs and UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jun 21 2006 - 15:15:10 CDT

Next message: Richard Wordingham: "Re: Surrogate pairs and UTF-8"

Previous message: Rick Cameron: "RE: Surrogate pairs and UTF-8"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Richard Wordingham: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Pavils Jurjans asked:

> - I have read the theoretical definition of what a surrogate pair is.
> However, I have never seen any in "life". Can you give an example of some
> surrogate pairs, and how do their respective character look like?
> - The guides on unicode.org site talk only about surrogate pair and
> UTF-16 conversion. How about the UTF-8?

"Surrogate pairs" don't exist in UTF-8.

Surrogate pairs refer to the 2 16-bit code unit sequences required
to represent Unicode code points U+10000..U+10FFFF in UTF-16.

That same range of code points is represented by 4-byte
sequences in UTF-8, as defined by the Tables 3-5 and Table 3-6
you were referring to in The Unicode Standard, Version 4.0.

Look at Table 3-3, Examples of Unicode Encoding Forms.

U+10302 is represented in UTF-32 by the 32-bit code unit: 0x00010302

U+10302 is represented in UTF-8 by the 4 byte sequence: <F0 90 8C 82>

U+10302 is represented in UTF-16 by the two 16-bit code unit
sequence: <D800 DF02>

That last encoding, in UTF-16 only, is referred to as a "surrogate pair".

--Ken

Next message: Richard Wordingham: "Re: Surrogate pairs and UTF-8"
Previous message: Rick Cameron: "RE: Surrogate pairs and UTF-8"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Richard Wordingham: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 15:47:42 CDT