Re: UTF8 and COntrol Characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 06:32:57 EST

  • Next message: Peter Kirk: "Re: UTF-16 inside UTF-8"

    From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
    > If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
    > contain and control character or escape sequences? If so, is it possible
    to
    > eliminate the same?

    UTF-8 sequences will not contain any C0 control bytes, but it will in many
    cases use contain C1 control bytes (between 0x80 and 0x9F).

    UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
    sequence of bytes containing them for non 7-bit ASCII characters (all
    sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
    create any escape sequence.

    But be warned that you should not create escape sequences containing bytes
    >= 0x80 after the leading escape (in this case, they may conflict with a
    UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
    bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
    escape sequences and UTF-8 sequences for non ASCII characters.

    Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
    of bytes in UTF-8, with the first byte being 0xC2, and the second byte
    varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
    0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 07:19:36 EST