Re: Invalid UTF-8 sequences

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 16:01:36 CST

Next message: Azzedine Ait Khelifa: "IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

Previous message: D. Starner: "Re: Nicest UTF"
In reply to: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan <lars.kristan@hermes.si> writes:

> Quite close. Except for the fact that:
> * U+EE93 is represented in UTF-32 as 0x0000EE93
> * U+EE93 is represented in UTF-16 as 0xEE93
> * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)

Then it would be impossible to represent sequences like
U+EEEE U+EEBA U+EE93 in UTF-8, and conversion UTF-32 -> UTF-8 -> UTF-32
would not round-trip.

Concatenation of UTF-8-encoded strings would not be equivalent to
UTF-8-encoding of the concatenation of code points.

This is broken.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Azzedine Ait Khelifa: "IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
Previous message: D. Starner: "Re: Nicest UTF"
In reply to: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:02:22 CST