Re: Invalid code points

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jun 01 2009 - 09:17:14 CDT

Next message: Michael Everson: "Re: IPA transformation (was "Dozenal chars in music")"

Previous message: Curtis Clark: "Re: IPA transformation (was "Dozenal chars in music")"
In reply to: Doug Ewell: "Re: Invalid code points"
Next in thread: Phillips, Addison: "RE: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 1 Jun 2009, at 15:48, Doug Ewell wrote:

>> I was just reading the successor sequence of RFCs:
>> http://tools.ietf.org/html/rfc2044
>> http://tools.ietf.org/html/rfc2279
>> http://tools.ietf.org/html/rfc3629
>>
>> The last one restricts UTF-8 to the Unicode range, the limitations
>> of UTF-16, but the others do not.
>
> That's one of the main reasons RFC 3629 was written to replace 2279.
> (Another reason was to be more conclusive about disallowing non-
> shortest sequences.)

I wasn't aware of this last one with the restrictions.

>> If wants an integer-to-byte sequence encoding, then it might be
>> better to design it differently than UTF-8, anyhow. If programs
>> just forward the byte sequences, there should be no problem. By if
>> some intermediate program would check for UTF-8 validity, that
>> could cause problems.
>
> You will already run into problems pretending binary data is a
> character string, and passing it around in a UTF-8-derived format,
> if the data contains zero values. These are more common in binary
> data than any other single value, and they will terminate your
> "string" prematurely.

Of course, but one easy way is to use the excluded duplicate numbers
0-127.

>> In the situation I had in mind, a byte sequence with no ties to C
>> strings or UTF-8 would be preferred, but the latter is forced by
>> the context (argument passing on a Unix computer). But there is an
>> interesting idea rather than a byte code, make an integer code, and
>> then use an integer-to-byte encoding, which then can be changed
>> according to context.
>
> There's no reason you can't pass a pointer to an array of integer
> types, plus an integer indicating the length of the array, as
> arguments on a Unix computer.

But the pointer values may change (I have checked that), so the array
may have been changed by copy over, at least the specs admit that. If
that is done, anything beyond '\0' is lost. Otherwise, that is a good
idea to pass extra info that only some programs may use.

> If you still prefer to convert the data to a string for some reason,
> there are plenty of binary-to-text conversion algorithms out there.
> Base64 comes to mind, though you may want more efficiency. Or if
> the strings are not going to leak out of your system, you can always
> develop your own algorithm to suit your needs.

Thanks for the input. It depends on much on what is might work nice,
and I'm not sure about that. The ideal would be to have to as little
changes as possible for traditional arguments, and have typing info
only at need. For example, "123" is not the same as 123, but the
former with quotes is likely to cause problems. So pass the the string
without quotes, and the same the number, and some programs would want
to distinguish the two.

Hans

Next message: Michael Everson: "Re: IPA transformation (was "Dozenal chars in music")"
Previous message: Curtis Clark: "Re: IPA transformation (was "Dozenal chars in music")"
In reply to: Doug Ewell: "Re: Invalid code points"
Next in thread: Phillips, Addison: "RE: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 09:20:07 CDT