Re: Surrogate pairs and UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jun 22 2006 - 08:16:23 CDT

  • Next message: Addison Phillips: "RE: Surrogate pairs and UTF-8"

    It's not impossible. UTF-8 strings may be found in contexts where the string was encoded and received using the ISO-8859-1. Once they are parsed from the document in which they were inserted, those 8-bit codeunits are converted from the transfer encoding to the JavaString internal encoding compatible with Unicode and exposed to the JavaSCript API as 16-bit UTF-16 code units.

    It's perfectly possible to encode and UTF-8 encoder/decoder in JavaScript, given that the original document inwhich the data was submitted may use an unspecified encoding,generally interpreted as ISO-8859-1 (which contains a full one-to-one and reversible mapping for all possible bytes in 0x00 to 0xFF and codepoints U+0000 to U+00FF or with UTF-16 code units 0x0000 to 0x00FF.

    There are cases where such JavaScript conversion is needed and it's perfectly possible (and in fact easy) to convert between the natural Javscript encoding, as seen in string.length(), string.codeCharAt(), or string.indexOf(...), and UTF-8. Such conversion is useful when JavaScript will be used to generate documents, or some responses to servers handling only the UTF-8 encoding in some specific protocol (for example when you need to compute binary signatures, or the encoded length in some part of this protocol).

    I see nothing wrong in writing such conversion function, which has its applications. And I'm not sure that all JavaScript or ECMAscript interprets will expose UTF-16 codeunits for their strings. The only thing required is that JavaScript must be able to represent and expose to the Javascript appilcation and API any document or input which was originally encoded with a charset which hassome mapping to Unicode. There's no guarantee that the Javascript string will be preserved on output when it is sent to a stream using a charset using a charset not completely covering the UCS. (Normally such conversion from the Javascript internal encoding of strings to another encoding is performed by the stream object, according to its settings properties, for example a HTTP or MIME message object where you can set the charset used for encoding/decoding their stream).

    Aren't there any Javascript implementation which exposes UTF-32 codeunits/codepoints for their string object(and accordingly the string length in terms of number of codepoints)?

    ----- Original Message -----
    From: "Addison Phillips" <addison@yahoo-inc.com>
    To: <unicode@unicode.org>
    Sent: Thursday, June 22, 2006 5:33 AM
    Subject: RE: Surrogate pairs and UTF-8

    > I'm disturbed by something here.
    >
    > Pavils wrote:
    >
    >> am a developer who needs to write UTF-8 encoder and decoder in
    >> JavaScript.
    >
    > In JavaScript, Strings (and thus text) are made up of arrays of UTF-16 code
    > units. Thus U+10000 is represented by the surrogate pair 0xD800 0xDC00. The
    > String class treats these as two "characters" in a String object (in methods
    > such as charCodeAt() or index()).
    >
    > In JavaScript there is no such thing as an "encoding". Text in the DOM or in
    > documents, headers, and other text sources that you are manipulating is
    > converted to/from the internal String class by the JavaScript runtime, which
    > is paying attention to HTTP headers and what the browser thinks the encoding
    > of the JS source file or the document being read or written is. The
    > exception to this is when generating URIs from strings, for which there are
    > a variety of escape methods (escape, unescape, encodeURI,
    > encodeURIComponent, etc.). What I'm getting at here is: there is no data
    > type or methods for manipulating bytes or character encodings. There is no
    > JavaScript equivalent to the C char* or Java byte. There is no way that I'm
    > aware of to write a UTF-8 encoder or decoder (i.e. code that converts a
    > String to a UTF-8 byte sequence in an object or vice versa). There are
    > plenty of ways to put Strings into a UTF-8 file (or read from a UTF-8 file).
    >
    > There is usually something (else) wrong when a developer is trying to do
    > this in JavaScript.
    >
    > Pavils: what is it you are trying to do that you think requires you to
    > encode or decode UTF-8?



    This archive was generated by hypermail 2.1.5 : Thu Jun 22 2006 - 08:41:52 CDT