Re: UTF8 and COntrol Characters

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 16:39:32 EST

  • Next message: John Delacour: "Re: [OT] Voiced velar fricative"

    Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

    > I think verdy_p's message is not very clear below. I think I know what
    he mean
    > but the message itself need some clearification.

    And I, in turn, will try to clarify the clarification. :-)

    >>> If a UNICODE strings is converted to UTF8, will the UTF8 encoded
    >>> string contain and control character or escape sequences? If so, is
    >>> it possible to eliminate the same?
    >>
    >> UTF-8 sequences will not contain any C0 control bytes,
    >
    > I think you should say
    > "The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to
    > represent characters. " instead of "UTF-8 sequences will not contain
    > any C0 control bytes, " because it is legal to have C0 control code
    > inside UTF-8, for example, TAB, CR, LF are all in c0 area and
    > perfectly legal in UTF-8.

    Neither is exactly right. UTF-8 does use the C0 control area, to
    represent C0 control characters. What I think everyone is trying to say
    is that UTF-8 does not use that area for *any other* characters, which
    of course was a basic design goal of UTF-8.

    >> but it will in many cases use contain C1 control bytes (between 0x80
    >> and 0x9F).
    >
    > I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F
    > as part of multiple byte UTF-8 byte serquence for a single Unicode
    > characters. And those bytes is defined as C1 control area. Therefore,
    > code code sequence with 0x80 and 0x9f should not be insert into UTF-8
    > STREAM, but could be insert into UTF-16 STREAM (by using two bytes
    > 0x0080 - 0x009F) .

    No code sequence that is not valid UTF-8 should ever be inserted into a
    UTF-8 stream anyway. I don't see the point of this wording.

    >> UTF-8 keeps all 7-bit ASCII characters unchanged and does not create
    >> any sequence of bytes containing them for non 7-bit ASCII characters
    >> (all sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will
    >> then never create any escape sequence.

    But it will retain any existing escape sequence consisting entirely of [
    7-bit] ASCII characters.

    >> But be warned that you should not create escape sequences containing
    >> bytes >= 0x80 after the leading escape (in this case, they may
    >> conflict with a UTF-8 decoder).. If your escape sequences are made
    >> only of 7-bit ASCII bytes, then this is safe, and you can mix plain-
    >> text ASCII, C0 controls, escape sequences and UTF-8 sequences for non
    >> ASCII characters.
    >
    > Not only "You should not create escape sequences containing bytes >=
    > 0x80 after the leading escape " but also "You should not create escape
    > sequences containing bytes >= 0x80 as the leading escape "

    We aren't in a position to tell Abdij not to use the C1 control area.
    It's his app. But for the record, he has stated that he isn't:

    > Yes, the control characters are entirely below 0x20 ASCII.

    so there is neither a problem nor a topic for discussion.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 17:29:01 EST