Re: please review the paper for me

From: Yung-Fong Tang (ftang@netscape.com)
Date: Tue Feb 25 2003 - 17:33:09 EST

  • Next message: Yung-Fong Tang: "Re: Unicode 4.0 BETA available for review"

    Doug Ewell wrote:

    >Yung-Fong Tang <ftang@netscape.com> wrote:
    >
    >
    >
    >>I am working on serveral project which need to validate UTF-8 text.
    >>Some people outside my company also ask me to update the UTF-8
    >>validation code to reflect the changes introduced in Unicode 3.1
    >>and 3.2.
    >>
    >>
    >
    >I am still puzzled by claims that there have been substantial "changes"
    >to UTF-8, especially the claim that the restriction against
    >non-shortest-form UTF-8 is something new. Even the original description
    >of "FSS-UTF" in Unicode 1.1 (1993) stated, "When there are multiple ways
    >to encode a value, for example U+0000, only the shortest encoding is
    >legal."
    >
    Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people
    refer to UTF-8 by looking at RFC 2279
    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
    and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
    but in that two RFCs, when it stated the decoding process, it does not
    mention checking the non-shortest-form

    >
    >Likewise, ever since the surrogate code point range was designated in
    >Unicode 2.0, it has been invalid (or at least nonsensical) to encode
    >values from U+D800 through U+DFFF directly in UTF-8.
    >
    Again, RFC 2279 is the one people look at when they refer to UTF-8. And
    the decoding process stated in there does not mention checking the range
    which directly map to D800-DFFF

    >And the
    >restriction against "5- and 6-byte encodings" is just an artifact of the
    >code points above U+10FFFF being permanently reserved. It's never been
    >allowable to encode invalid code points.
    >
    RFC 2279 clearly said in the beginning of "2 UTF-8 definition"
    "In UTF-8, characters are encoded using sequences of 1 to 6 octets." If
    people refer to RFC 2279 for the UTF-8 definitation, it is CLEAR that 5
    and 6 octets IS legal UTF-8 sequence.

    >
    >The pre-3.2 distinction between "illegal" and "irregular" UTF-8
    >sequences was a strange bird. Basically it was forbidden to create
    >"irregular" sequences, but OK to interpret them if you found any. This
    >is like saying it's illegal to sell drugs but legal to buy them.
    >Unicode 3.1 and 3.2 simply closed this odd loophole.
    >
    Agree.

    >
    >I found Frank's state machine interesting, but I generally find it
    >easier to check for valid UTF-8 by decoding all valid sequences and
    >checking each character thus decoded to ensure it falls in the
    >appropriate range:
    >
    >1 byte: U+0000 through U+007F
    >2 bytes: U+0080 through U+07FF
    >3 bytes: U+0800 through U+D7FF, U+E000 through U+FFFD
    >
    Well... that is another question. Is UTF-8 which represent U+FFFE and
    U+FFFF legal UTF-8 sequence? If not, it need to be very clear in Unicode
    4.0 or the new Internet Draft which specify the UTF-8. So far I didn't
    find such claim in the Unicode 3.1, 3.2 and RFC 2279 in the definitation
    of UTF-8. I have not check the new Internet Draft yet and I have no
    access to the unicdoe 4.0 beta text.

    >4 bytes: U+10000 through U+10FFFD (excluding all U+xFFFE and U+xFFFF)
    >
    Why you excluding U+xFFFE and U+xFFFF ? Any text in Unicode talk about
    that? How should we state that in the UTF-8 session ? At least in the
    Unicode 3.2 text, it is not clear that U+FFFE, U+FFFF, U+xFFFE, U+xFFFF
    should be treated as illegal UTF-8. It could be legal UTF-8 sequence
    which encode illegal unicode code point. (Just like you may have a valid
    Base64 encoded file which encode an illegal GIF file. Your base 64 is
    legal, fully conform to Base64 decoding logic and could be decoed, but
    the decoded file is not a legal GIF file which conform to the GIF file
    specification) Where is the boundary of legal UTF-8 from legal Unicode ?

    >
    >Then the only two failure conditions are (1) invalid sequence and (2)
    >character out of range.
    >
    >-Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >
    >
    >
    In order to decode all the sequence, you still need a state machine
    anyway. It just you may use a different statemachine and add some more
    action to it. The problem is, if we receive a stream of UTF-8 text data
    but with some illegal byte point there, and the environment does not
    allow you REJECT such input and the only choice is to skip the illegal
    part, how many bytes will you skilp? Different statemachine will come
    out with different result.



    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 18:16:26 EST