Re: please review the paper for me

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Feb 25 2003 - 01:14:53 EST

  • Next message: Doug Ewell: "Re: UTF-8 question"

    Yung-Fong Tang <ftang@netscape.com> wrote:

    > I am working on serveral project which need to validate UTF-8 text.
    > Some people outside my company also ask me to update the UTF-8
    > validation code to reflect the changes introduced in Unicode 3.1
    > and 3.2.

    I am still puzzled by claims that there have been substantial "changes"
    to UTF-8, especially the claim that the restriction against
    non-shortest-form UTF-8 is something new. Even the original description
    of "FSS-UTF" in Unicode 1.1 (1993) stated, "When there are multiple ways
    to encode a value, for example U+0000, only the shortest encoding is
    legal."

    Likewise, ever since the surrogate code point range was designated in
    Unicode 2.0, it has been invalid (or at least nonsensical) to encode
    values from U+D800 through U+DFFF directly in UTF-8. And the
    restriction against "5- and 6-byte encodings" is just an artifact of the
    code points above U+10FFFF being permanently reserved. It's never been
    allowable to encode invalid code points.

    The pre-3.2 distinction between "illegal" and "irregular" UTF-8
    sequences was a strange bird. Basically it was forbidden to create
    "irregular" sequences, but OK to interpret them if you found any. This
    is like saying it's illegal to sell drugs but legal to buy them.
    Unicode 3.1 and 3.2 simply closed this odd loophole.

    I found Frank's state machine interesting, but I generally find it
    easier to check for valid UTF-8 by decoding all valid sequences and
    checking each character thus decoded to ensure it falls in the
    appropriate range:

    1 byte: U+0000 through U+007F
    2 bytes: U+0080 through U+07FF
    3 bytes: U+0800 through U+D7FF, U+E000 through U+FFFD
    4 bytes: U+10000 through U+10FFFD (excluding all U+xFFFE and U+xFFFF)

    Then the only two failure conditions are (1) invalid sequence and (2)
    character out of range.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 01:54:57 EST