From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Wed Jun 21 2006 - 14:23:47 CDT
However, if you are converting between UTF-8 and UTF-16 you do need to
take surrogates into account.
Perhaps the best approach is to go via UTF-32: for example, when
converting from UTF-16 to UTF-8, iterate through the array of UTF-16
code units, converting each code point to a UTF-32 code unit, then
convert the UTF-32 code unit to UTF-8. When iterating, you would check
whether the current UTF-16 code unit is the start of a surrogate pair or
not, and consume one or two code units as appropriate.
From: email@example.com [mailto:firstname.lastname@example.org] On
Behalf Of Mike Ayers
Sent: Wednesday, 21 June 2006 11:31
To: Pavils Jurjans
Subject: Re: Surrogate pairs and UTF-8
Pavils Jurjans wrote:
> - The guides on unicode.org <http://unicode.org/> site talk only about
> surrogate pair and UTF-16 conversion. How about the UTF-8?
Surrogates do not exist in UTF-8. They are the mechanism by
UCS-2 (which encodes 16 bits) was simultaneously restricted and extend
to become UTF-16 (which encodes 21 bits). Surrogates are not
characters. They are UTF-16 code points only.
This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 14:45:43 CDT