RE: Surrogate pairs and UTF-8

From: Addison Phillips (addison@yahoo-inc.com)
Date: Thu Jun 22 2006 - 09:46:27 CDT

Next message: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"

Previous message: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
In reply to: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Next in thread: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"
Reply: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> There are cases where such JavaScript conversion is needed
> and it's perfectly possible (and in fact easy) to convert
> between the natural Javscript encoding, as seen in
> string.length(), string.codeCharAt(), or string.indexOf(...),
> and UTF-8.

Okay, I concede that one can convert a JavaScript String's code points in an
attempt to un-mojibake it... except please note what I said:

> There is usually something (else) wrong when a developer is
> trying to do this in JavaScript.

That is, yes, one can attempt to fix one's data that way (it does rely on
the content being interpreted as 8859-1--and if there are bytes in the 0x80
through 0x9F range a lot of user-agents are going to interpret that as
windows-1252, even if it is labelled as 8859-1).

> Such conversion is useful when JavaScript will be
> used to generate documents, or some responses to servers
> handling only the UTF-8 encoding in some specific protocol
> (for example when you need to compute binary signatures, or
> the encoded length in some part of this protocol).

Uh... why not assemble a String containing the text and then set the
Content-Type of the document to the desired encoding (UTF-8 in this case)?
Assembling documents in UTF-8 via manual conversion is not necessary. And it
is prone to error.

> There's no guarantee that the Javascript string
> will be preserved on output when it is sent to a stream using
> a charset using a charset not completely covering the UCS.
> (Normally such conversion from the Javascript internal
> encoding of strings to another encoding is performed by the
> stream object, according to its settings properties, for
> example a HTTP or MIME message object where you can set the
> charset used for encoding/decoding their stream).

Yes. That's exactly what I said. Hence: what are you writing an
encoder/decoder for?

Addison

Addison Phillips
Internationalization Architect - Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
> Sent: jeudi 22 juin 2006 06:16
> To: Addison Phillips; unicode@unicode.org
> Subject: Re: Surrogate pairs and UTF-8

Next message: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"
Previous message: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
In reply to: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Next in thread: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"
Reply: Pavils Jurjans: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 22 2006 - 10:13:26 CDT