RE: Surrogate pairs and UTF-8

From: Addison Phillips (addison@yahoo-inc.com)
Date: Thu Jun 22 2006 - 09:46:27 CDT

  • Next message: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"

    > There are cases where such JavaScript conversion is needed
    > and it's perfectly possible (and in fact easy) to convert
    > between the natural Javscript encoding, as seen in
    > string.length(), string.codeCharAt(), or string.indexOf(...),
    > and UTF-8.

    Okay, I concede that one can convert a JavaScript String's code points in an
    attempt to un-mojibake it... except please note what I said:

    > There is usually something (else) wrong when a developer is
    > trying to do this in JavaScript.

    That is, yes, one can attempt to fix one's data that way (it does rely on
    the content being interpreted as 8859-1--and if there are bytes in the 0x80
    through 0x9F range a lot of user-agents are going to interpret that as
    windows-1252, even if it is labelled as 8859-1).

    > Such conversion is useful when JavaScript will be
    > used to generate documents, or some responses to servers
    > handling only the UTF-8 encoding in some specific protocol
    > (for example when you need to compute binary signatures, or
    > the encoded length in some part of this protocol).

    Uh... why not assemble a String containing the text and then set the
    Content-Type of the document to the desired encoding (UTF-8 in this case)?
    Assembling documents in UTF-8 via manual conversion is not necessary. And it
    is prone to error.

    > There's no guarantee that the Javascript string
    > will be preserved on output when it is sent to a stream using
    > a charset using a charset not completely covering the UCS.
    > (Normally such conversion from the Javascript internal
    > encoding of strings to another encoding is performed by the
    > stream object, according to its settings properties, for
    > example a HTTP or MIME message object where you can set the
    > charset used for encoding/decoding their stream).

    Yes. That's exactly what I said. Hence: what are you writing an
    encoder/decoder for?

    Addison

    Addison Phillips
    Internationalization Architect - Yahoo! Inc.

    Internationalization is an architecture.
    It is not a feature.

    > -----Original Message-----
    > From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    > Sent: jeudi 22 juin 2006 06:16
    > To: Addison Phillips; unicode@unicode.org
    > Subject: Re: Surrogate pairs and UTF-8



    This archive was generated by hypermail 2.1.5 : Thu Jun 22 2006 - 10:13:26 CDT