RE: Surrogate pairs and UTF-8

From: Rick Cameron ([email protected])
Date: Wed Jun 21 2006 - 14:23:47 CDT

Next message: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"

Previous message: Mike Ayers: "Re: Surrogate pairs and UTF-8"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

However, if you are converting between UTF-8 and UTF-16 you do need to
take surrogates into account.

Perhaps the best approach is to go via UTF-32: for example, when
converting from UTF-16 to UTF-8, iterate through the array of UTF-16
code units, converting each code point to a UTF-32 code unit, then
convert the UTF-32 code unit to UTF-8. When iterating, you would check
whether the current UTF-16 code unit is the start of a surrogate pair or
not, and consume one or two code units as appropriate.

- rick

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Mike Ayers
Sent: Wednesday, 21 June 2006 11:31
To: Pavils Jurjans
Cc: [email protected]
Subject: Re: Surrogate pairs and UTF-8

Pavils Jurjans wrote:

> - The guides on unicode.org <http://unicode.org/> site talk only about

> surrogate pair and UTF-16 conversion. How about the UTF-8?

Surrogates do not exist in UTF-8. They are the mechanism by
which
UCS-2 (which encodes 16 bits) was simultaneously restricted and extend
to become UTF-16 (which encodes 21 bits). Surrogates are not
characters. They are UTF-16 code points only.

HTH,

/|/|ike

Next message: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"
Previous message: Mike Ayers: "Re: Surrogate pairs and UTF-8"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 14:45:43 CDT