Re: Invalid UTF-8 sequences

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 16:01:36 CST

  • Next message: Azzedine Ait Khelifa: "IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

    Lars Kristan <lars.kristan@hermes.si> writes:

    > Quite close. Except for the fact that:
    > * U+EE93 is represented in UTF-32 as 0x0000EE93
    > * U+EE93 is represented in UTF-16 as 0xEE93
    > * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)

    Then it would be impossible to represent sequences like
    U+EEEE U+EEBA U+EE93 in UTF-8, and conversion UTF-32 -> UTF-8 -> UTF-32
    would not round-trip.

    Concatenation of UTF-8-encoded strings would not be equivalent to
    UTF-8-encoding of the concatenation of code points.

    This is broken.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:02:22 CST