Re: Surrogate pairs and UTF-8

From: Mike (mike-list@pobox.com)
Date: Wed Jun 21 2006 - 18:33:54 CDT

  • Next message: J Andrew Lipscomb: "Re: unicode Digest V6 #134"

    If you come across a surrogate in an UTF-8 stream, you should treat
    it as an error (e.g. throw an exception or something).

    However if you are converting from UTF-16 to UTF-8, then you will
    need two surrogates (a high surrogate and a low surrogate) to
    determine which character is encoded. Table 3.4 in the link you
    cited shows how to convert from surrogate pairs to codepoints.

    Once you know which codepoint is encoded, use table 3-5 to compute
    the byte values in the UTF-8 sequence.

    Mike

    P.S. Here is an array you should find useful in determining how to
    decode UTF-8 sequences:

    const uchar Utf8Length[256] = {
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8
    };

    Use the first byte of the UTF-8 sequence as an index into this
    array. The value is the length of the UTF-8 sequence. If it is
    not in the range 1-4, then you have run into an error. Once you
    have the length, if it is in the range 2-4, check the next n-1
    bytes to make sure they return 0.

    Pavils Jurjans wrote:
    > Hello all,
    >
    > I am a developer who needs to write UTF-8 encoder and decoder in
    JavaScript. I've found the encoding form in the link
    http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
    <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703>, and that
    is pretty much what I need to do the job. However, I am completely
    lacking in-depth information about the surrogate pairs and how to handle
    them in UTF-8. So, here are the questions, what I am looking for:
    > - I have read the theoretical definition of what a surrogate pair is.
    However, I have never seen any in "life". Can you give an example of
    some surrogate pairs, and how do their respective character look like?
    > - The guides on unicode.org <http://unicode.org/> site talk only
    about surrogate pair and UTF-16 conversion. How about the UTF-8?
    >
    > Thank you for any clues.
    >
    > With kind regards,
    > Pavils Jurjans
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 18:57:07 CDT