Re: UTF-8 ill-formed question from Asmus Freytag on 2012-12-11 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Tue, 11 Dec 2012 12:31:57 -0800

On 12/11/2012 11:50 AM, vanisaac_at_boil.afraid.org wrote:
> From: James Lin <James_Lin_at_symantec.com>
>> Hi
>> Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow > the pattern of UTF-8 byte-sequences, i just wondering how or why?
>> If i have a code point: U+4E8C or "二"
>> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". Where is this "BA"
>> comes from?
>>
>> thanks
>> -James
> Each of the UTF encodings represents the binary data in different ways. So we
> need to break the scalar value, U+4E8C, into its binary representation before
> we proceed.
>
> 4E8C -> 0100 1110 1000 1100
>
> Then, we need to look up the rules for UTF-8. It states that code points
> between U+800 and U+FFFF are encoded with three bytes, in the form 1110xxxx
> 10xxxxxx 10xxxxxx. So plugging in our data, we get
>
> 4 E 8 C
> 0100 1110 10-00 1100
> |||| ||||// \\||||
> + 1110xxxx 10xxxxxx 10xxxxxx
>
> = 11100100 10111010 10001100
> or E 4 B A 8 C
>
> -Van Anderson
>
Nice!

A./

PS: I fixed a missing "\"
Received on Tue Dec 11 2012 - 14:33:59 CST

This archive was generated by hypermail 2.2.0 : Tue Dec 11 2012 - 14:34:00 CST