Re: Invalid code points

From: Doug Ewell (doug@ewellic.org)
Date: Mon Jun 01 2009 - 08:48:32 CDT

  • Next message: Andrew Lipscomb: "Re: unicode Digest V10 #106"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > I was just reading the successor sequence of RFCs:
    > http://tools.ietf.org/html/rfc2044
    > http://tools.ietf.org/html/rfc2279
    > http://tools.ietf.org/html/rfc3629
    >
    > The last one restricts UTF-8 to the Unicode range, the limitations of
    > UTF-16, but the others do not.

    That's one of the main reasons RFC 3629 was written to replace 2279.
    (Another reason was to be more conclusive about disallowing non-shortest
    sequences.)

    > If wants an integer-to-byte sequence encoding, then it might be better
    > to design it differently than UTF-8, anyhow. If programs just forward
    > the byte sequences, there should be no problem. By if some
    > intermediate program would check for UTF-8 validity, that could cause
    > problems.

    You will already run into problems pretending binary data is a character
    string, and passing it around in a UTF-8-derived format, if the data
    contains zero values. These are more common in binary data than any
    other single value, and they will terminate your "string" prematurely.

    > In the situation I had in mind, a byte sequence with no ties to C
    > strings or UTF-8 would be preferred, but the latter is forced by the
    > context (argument passing on a Unix computer). But there is an
    > interesting idea rather than a byte code, make an integer code, and
    > then use an integer-to-byte encoding, which then can be changed
    > according to context.

    There's no reason you can't pass a pointer to an array of integer types,
    plus an integer indicating the length of the array, as arguments on a
    Unix computer.

    If you still prefer to convert the data to a string for some reason,
    there are plenty of binary-to-text conversion algorithms out there.
    Base64 comes to mind, though you may want more efficiency. Or if the
    strings are not going to leak out of your system, you can always develop
    your own algorithm to suit your needs.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 08:51:45 CDT