Re: Invalid code points

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jun 04 2009 - 02:51:00 CDT

  • Next message: William_J_G Overington: "Re: Invalid code points"

    On 1 Jun 2009, at 17:46, Mark Crispin wrote:

    > I think that are two obvious implementation choices:
    >
    > [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges,
    > never generate them, and if a value in that range is encountered
    > treat it as an "error" or "not in Unicode" value. This is the
    > traditional IETF philosophy.
    >
    > [2] Strictly enforce the rules for "well formed UTF-8 byte
    > sequences" on page 104 of Unicode 5.0, and reject any string which
    > fails to comply (note in particular the requirements of the second
    > byte).
    >
    > In all cases, what is generated must strictly comply with "well
    > formed UTF-8 byte sequences".
    >
    > I have little doubt that Unicode would tend to advocate choice [2],
    > but as noted above the "IETF way" would be choice [1].
    >
    > As a practical matter, it should not make any difference. You
    > should never expect anything other than a well-formed sequence to
    > work.

    In the end, I decided to make my own integer-to-byte-encoding, wanting
    to cover negative and larger integers, but keeping some fundamental
    UTF-8 properties: range 1-127 same, disjoint sets of leading and
    trailing bytes admitting resynchronization. (And 0 is not mapped to
    '\0', though it could.)

    But if one makes a byte code by first making an integer code and
    translating it using the UTF-8 method, then it would have the
    properties that embedded strings appear as normal UTF-8 (assuming
    their integer representation is by their code points). If further, and
    editor has the property that all the invalid code points (but still
    legal UTF-8) can shown say by escape sequences, then all byte code can
    be seen (editors may simply report they cannot parse the code as
    UTF-8, not showing anything), perhaps useful for debugging purposes.

    So that is one possible use for an extended UTF-8 format, beyond the
    Unicode range.

       Hans



    This archive was generated by hypermail 2.1.5 : Thu Jun 04 2009 - 02:54:08 CDT