RE: Surrogate pairs and UTF-8

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Jun 21 2006 - 22:33:14 CDT

  • Next message: Otto Stolz: "Re: Surrogate pairs and UTF-8"

    I'm disturbed by something here.

    Pavils wrote:

    > am a developer who needs to write UTF-8 encoder and decoder in
    > JavaScript.

    In JavaScript, Strings (and thus text) are made up of arrays of UTF-16 code
    units. Thus U+10000 is represented by the surrogate pair 0xD800 0xDC00. The
    String class treats these as two "characters" in a String object (in methods
    such as charCodeAt() or index()).

    In JavaScript there is no such thing as an "encoding". Text in the DOM or in
    documents, headers, and other text sources that you are manipulating is
    converted to/from the internal String class by the JavaScript runtime, which
    is paying attention to HTTP headers and what the browser thinks the encoding
    of the JS source file or the document being read or written is. The
    exception to this is when generating URIs from strings, for which there are
    a variety of escape methods (escape, unescape, encodeURI,
    encodeURIComponent, etc.). What I'm getting at here is: there is no data
    type or methods for manipulating bytes or character encodings. There is no
    JavaScript equivalent to the C char* or Java byte. There is no way that I'm
    aware of to write a UTF-8 encoder or decoder (i.e. code that converts a
    String to a UTF-8 byte sequence in an object or vice versa). There are
    plenty of ways to put Strings into a UTF-8 file (or read from a UTF-8 file).

    There is usually something (else) wrong when a developer is trying to do
    this in JavaScript.

    Pavils: what is it you are trying to do that you think requires you to
    encode or decode UTF-8?

    Addison

    Addison Phillips
    Internationalization Architect - Yahoo! Inc.

    Internationalization is an architecture.
    It is not a feature.

    > -----Original Message-----
    > From: unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org] On Behalf Of Mike
    > Sent: mercredi 21 juin 2006 16:34
    > To: unicode@unicode.org
    > Subject: Re: Surrogate pairs and UTF-8
    >
    > If you come across a surrogate in an UTF-8 stream, you should treat
    > it as an error (e.g. throw an exception or something).
    >
    > However if you are converting from UTF-16 to UTF-8, then you will
    > need two surrogates (a high surrogate and a low surrogate) to
    > determine which character is encoded. Table 3.4 in the link you
    > cited shows how to convert from surrogate pairs to codepoints.
    >
    > Once you know which codepoint is encoded, use table 3-5 to compute
    > the byte values in the UTF-8 sequence.
    >
    > Mike
    >
    > P.S. Here is an array you should find useful in determining how to
    > decode UTF-8 sequences:
    >
    > const uchar Utf8Length[256] = {
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    > 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    > 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    > 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    > 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8
    > };
    >
    > Use the first byte of the UTF-8 sequence as an index into this
    > array. The value is the length of the UTF-8 sequence. If it is
    > not in the range 1-4, then you have run into an error. Once you
    > have the length, if it is in the range 2-4, check the next n-1
    > bytes to make sure they return 0.
    >
    > Pavils Jurjans wrote:
    > > Hello all,
    > >
    > > I am a developer who needs to write UTF-8 encoder and decoder in
    > JavaScript. I've found the encoding form in the link
    > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
    > <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703>
    > , and that
    > is pretty much what I need to do the job. However, I am completely
    > lacking in-depth information about the surrogate pairs and
    > how to handle
    > them in UTF-8. So, here are the questions, what I am looking for:
    > > - I have read the theoretical definition of what a
    > surrogate pair is.
    > However, I have never seen any in "life". Can you give an example of
    > some surrogate pairs, and how do their respective character look like?
    > > - The guides on unicode.org <http://unicode.org/> site talk only
    > about surrogate pair and UTF-16 conversion. How about the UTF-8?
    > >
    > > Thank you for any clues.
    > >
    > > With kind regards,
    > > Pavils Jurjans
    > >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 23:12:03 CDT