RE: Surrogate pairs and UTF-8

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Wed Jun 21 2006 - 14:23:47 CDT

  • Next message: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"

    However, if you are converting between UTF-8 and UTF-16 you do need to
    take surrogates into account.

    Perhaps the best approach is to go via UTF-32: for example, when
    converting from UTF-16 to UTF-8, iterate through the array of UTF-16
    code units, converting each code point to a UTF-32 code unit, then
    convert the UTF-32 code unit to UTF-8. When iterating, you would check
    whether the current UTF-16 code unit is the start of a surrogate pair or
    not, and consume one or two code units as appropriate.

    - rick

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    Behalf Of Mike Ayers
    Sent: Wednesday, 21 June 2006 11:31
    To: Pavils Jurjans
    Cc: unicode@unicode.org
    Subject: Re: Surrogate pairs and UTF-8

    Pavils Jurjans wrote:

    > - The guides on unicode.org <http://unicode.org/> site talk only about

    > surrogate pair and UTF-16 conversion. How about the UTF-8?

            Surrogates do not exist in UTF-8. They are the mechanism by
    which
    UCS-2 (which encodes 16 bits) was simultaneously restricted and extend
    to become UTF-16 (which encodes 21 bits). Surrogates are not
    characters. They are UTF-16 code points only.

            HTH,

    /|/|ike



    This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 14:45:43 CDT