Re: Invalid code points

From: Hans Aberg (
Date: Mon Jun 01 2009 - 09:17:14 CDT

  • Next message: Michael Everson: "Re: IPA transformation (was "Dozenal chars in music")"

    On 1 Jun 2009, at 15:48, Doug Ewell wrote:

    >> I was just reading the successor sequence of RFCs:
    >> The last one restricts UTF-8 to the Unicode range, the limitations
    >> of UTF-16, but the others do not.
    > That's one of the main reasons RFC 3629 was written to replace 2279.
    > (Another reason was to be more conclusive about disallowing non-
    > shortest sequences.)

    I wasn't aware of this last one with the restrictions.

    >> If wants an integer-to-byte sequence encoding, then it might be
    >> better to design it differently than UTF-8, anyhow. If programs
    >> just forward the byte sequences, there should be no problem. By if
    >> some intermediate program would check for UTF-8 validity, that
    >> could cause problems.
    > You will already run into problems pretending binary data is a
    > character string, and passing it around in a UTF-8-derived format,
    > if the data contains zero values. These are more common in binary
    > data than any other single value, and they will terminate your
    > "string" prematurely.

    Of course, but one easy way is to use the excluded duplicate numbers

    >> In the situation I had in mind, a byte sequence with no ties to C
    >> strings or UTF-8 would be preferred, but the latter is forced by
    >> the context (argument passing on a Unix computer). But there is an
    >> interesting idea rather than a byte code, make an integer code, and
    >> then use an integer-to-byte encoding, which then can be changed
    >> according to context.
    > There's no reason you can't pass a pointer to an array of integer
    > types, plus an integer indicating the length of the array, as
    > arguments on a Unix computer.

    But the pointer values may change (I have checked that), so the array
    may have been changed by copy over, at least the specs admit that. If
    that is done, anything beyond '\0' is lost. Otherwise, that is a good
    idea to pass extra info that only some programs may use.

    > If you still prefer to convert the data to a string for some reason,
    > there are plenty of binary-to-text conversion algorithms out there.
    > Base64 comes to mind, though you may want more efficiency. Or if
    > the strings are not going to leak out of your system, you can always
    > develop your own algorithm to suit your needs.

    Thanks for the input. It depends on much on what is might work nice,
    and I'm not sure about that. The ideal would be to have to as little
    changes as possible for traditional arguments, and have typing info
    only at need. For example, "123" is not the same as 123, but the
    former with quotes is likely to cause problems. So pass the the string
    without quotes, and the same the number, and some programs would want
    to distinguish the two.


    This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 09:20:07 CDT