RE: Invalid code points

From: Mark Crispin (mrc+unicode@panda.com)
Date: Mon Jun 01 2009 - 10:46:37 CDT

  • Next message: Hans Aberg: "Re: Invalid code points"

    On Mon, 1 Jun 2009, Phillips, Addison wrote:
    > Uh... the IETF does not define UTF-8. The Unicode Consortium does. But
    > even if you want to build on the IETF documents, RFC 3629 was published
    > six years ago. Basing a new implementation on something published 11
    > years ago and obsolete the last six years? Not a good idea.

    This is true; but generally within the IETF specifications are upwards
    compatible.

    I think that are two obvious implementation choices:

    [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges, never
    generate them, and if a value in that range is encountered treat it as an
    "error" or "not in Unicode" value. This is the traditional IETF
    philosophy.

    [2] Strictly enforce the rules for "well formed UTF-8 byte sequences" on
    page 104 of Unicode 5.0, and reject any string which fails to comply (note
    in particular the requirements of the second byte).

    In all cases, what is generated must strictly comply with "well formed
    UTF-8 byte sequences".

    I have little doubt that Unicode would tend to advocate choice [2], but as
    noted above the "IETF way" would be choice [1].

    As a practical matter, it should not make any difference. You should
    never expect anything other than a well-formed sequence to work.

    -- Mark --

    http://panda.com/mrc
    Democracy is two wolves and a sheep deciding what to eat for lunch.
    Liberty is a well-armed sheep contesting the vote.



    This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 10:48:50 CDT