Re: UTF8 and COntrol Characters

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 13:53:57 EST

  • Next message: Doug Ewell: "Re: UTF-16 inside UTF-8"

    I think verdy_p's message is not very clear below. I think I know what he
    mean but the message itself need some clearification.

    In a message dated 11/5/2003 3:39:09 AM Pacific Standard Time,
    verdy_p@wanadoo.fr writes:
    From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
    > If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
    > contain and control character or escape sequences? If so, is it possible
    to
    > eliminate the same?

    UTF-8 sequences will not contain any C0 control bytes,
    I think you should say
    "The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to
    represent characters. " instead of "UTF-8 sequences will not contain any C0 control
    bytes, " because it is legal to have C0 control code inside UTF-8, for example,
    TAB, CR, LF are all in c0 area and perfectly legal in UTF-8.
    but it will in many
    cases use contain C1 control bytes (between 0x80 and 0x9F).
    I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F as part
    of multiple byte UTF-8 byte serquence for a single Unicode characters. And
    those bytes is defined as C1 control area. Therefore, code code sequence with
    0x80 and 0x9f should not be insert into UTF-8 STREAM, but could be insert into
    UTF-16 STREAM (by using two bytes 0x0080 - 0x009F) .

    UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
    sequence of bytes containing them for non 7-bit ASCII characters (all
    sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
    create any escape sequence.

    But be warned that you should not create escape sequences containing bytes
    >= 0x80 after the leading escape (in this case, they may conflict with a
    UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
    bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
    escape sequences and UTF-8 sequences for non ASCII characters.
    Not only "You should not create escape sequences containing bytes
    >= 0x80 after the leading escape " but also "You should not create escape
    sequences containing bytes >= 0x80 as the leading escape "
    Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
    of bytes in UTF-8, with the first byte being 0xC2, and the second byte
    varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
    0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)

    ==================================
    Frank Yung-Fong Tang
    System Architect, Itrntinl Dvlpmet, AOL Intrtv Srvies
    AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan

    John 3:16 "For God so loved the world that he gave his one and only Son, that
    whoever believes in him shall not perish but have eternal life.

    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Itrntinliztin Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at
    http://c3po.links.nectec.or.th/parsit/



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:13:24 EST