Re: Surrogate pairs and UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jun 21 2006 - 15:15:10 CDT

  • Next message: Richard Wordingham: "Re: Surrogate pairs and UTF-8"

    Pavils Jurjans asked:

    > - I have read the theoretical definition of what a surrogate pair is.
    > However, I have never seen any in "life". Can you give an example of some
    > surrogate pairs, and how do their respective character look like?
    > - The guides on unicode.org site talk only about surrogate pair and
    > UTF-16 conversion. How about the UTF-8?

    "Surrogate pairs" don't exist in UTF-8.

    Surrogate pairs refer to the 2 16-bit code unit sequences required
    to represent Unicode code points U+10000..U+10FFFF in UTF-16.

    That same range of code points is represented by 4-byte
    sequences in UTF-8, as defined by the Tables 3-5 and Table 3-6
    you were referring to in The Unicode Standard, Version 4.0.

    Look at Table 3-3, Examples of Unicode Encoding Forms.

    U+10302 is represented in UTF-32 by the 32-bit code unit: 0x00010302

    U+10302 is represented in UTF-8 by the 4 byte sequence: <F0 90 8C 82>

    U+10302 is represented in UTF-16 by the two 16-bit code unit
              sequence: <D800 DF02>
              
    That last encoding, in UTF-16 only, is referred to as a "surrogate pair".

    --Ken
              



    This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 15:47:42 CDT