Re: need help understanding diacritical encoding

From: jon@spin.ie
Date: Thu Sep 25 2003 - 06:59:46 EDT

Next message: jon@spin.ie: "Re: Unicode Normalisaton Optimisation Experiments"

Previous message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"
Maybe in reply to: Steve Pruitt: "need help understanding diacritical encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I have a form that posts diacritical characters.Ffor example, when my browser
> has the encoding set to utf-8 and the form posts the character É
> the post data has these two bytes C3 and 89, which when echoed back on a new
> page is displayed as Ã?. Can someone explain when the character is converted
> to two bytes how I get C3 and 89?
>

UTF-8 is explained in section 3.9 of the Unicode standard and elsewhere (RFC 2279 is a heavily-referenced document, note that its description includes the encoding of codepoints outside of the Unicode range).

É is U+00C9 and in binary that is:

0000000011001001

UTF-8 encoding results in different numbers of bytes depending on how many bits you have when you remove the leading zeros (8 bits in this case - resulting in two bytes).

It then puts those bits from the codepoint into bytes as so:

00000000 0xxxxxxx -> 0xxxxxxx
00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx -> 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

In the case of U+00C9 the second of these is the shortest form possible, so it is used. The bits 00011 are placed in 110yyyyy to give you 11000011 (0xC3) and the bits 001001 are placed in 10xxxxx to give you 10001001 (0x89).

The problem is that this didn't happen when the bytes went back out again - rather the bytes where interpreted as being part of a string encoded in some other way (most likely ISO 8859-1, which certainly would produce Ã followed by a control character from those bytes). It may be that all you need to do is to correctly report the encoding, by sending a HTTP header of the mime-type and charset (some server-side APIs make this easy, e.g. in ASP you would use Response.Charset = "utf-8"). It may be that you need to do futher work (depending on just what it is you are doing with the form).

Next message: jon@spin.ie: "Re: Unicode Normalisaton Optimisation Experiments"
Previous message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"
Maybe in reply to: Steve Pruitt: "need help understanding diacritical encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 07:44:38 EDT