Re: Invalid code points

From: Asmus Freytag (
Date: Mon Jun 01 2009 - 11:31:48 CDT

  • Next message: David J. Perry: "Re: Old Italic in RTL ??"

    The reason for the strict enforcement have to do with securtiy, i.e. by
    adhering to [2] you will be denying certain types of "bad utf-8" attacks
    that are possible under [1].

    Not a minor "practical" concern.


    On 6/1/2009 8:46 AM, Mark Crispin wrote:
    > On Mon, 1 Jun 2009, Phillips, Addison wrote:
    >> Uh... the IETF does not define UTF-8. The Unicode Consortium does.
    >> But even if you want to build on the IETF documents, RFC 3629 was
    >> published six years ago. Basing a new implementation on something
    >> published 11 years ago and obsolete the last six years? Not a good idea.
    > This is true; but generally within the IETF specifications are upwards
    > compatible.
    > I think that are two obvious implementation choices:
    > [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges,
    > never generate them, and if a value in that range is encountered treat
    > it as an "error" or "not in Unicode" value. This is the traditional
    > IETF philosophy.
    > [2] Strictly enforce the rules for "well formed UTF-8 byte sequences"
    > on page 104 of Unicode 5.0, and reject any string which fails to
    > comply (note in particular the requirements of the second byte).
    > In all cases, what is generated must strictly comply with "well formed
    > UTF-8 byte sequences".
    > I have little doubt that Unicode would tend to advocate choice [2],
    > but as noted above the "IETF way" would be choice [1].
    > As a practical matter, it should not make any difference. You should
    > never expect anything other than a well-formed sequence to work.
    > -- Mark --
    > Democracy is two wolves and a sheep deciding what to eat for lunch.
    > Liberty is a well-armed sheep contesting the vote.

    This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 11:33:15 CDT