Re: Invalid code points

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jun 01 2009 - 02:33:07 CDT

Next message: Andrew West: "Re: Old Italic in RTL ??"

Previous message: Hans Aberg: "Re: Invalid code points"
Maybe in reply to: Hans Aberg: "Re: Invalid code points"
Next in thread: Doug Ewell: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 1 Jun 2009, at 03:50, Doug Ewell wrote:

>> If I understand Hans Aberg's point, he means that one can abstract
>> the mapping from the non-negative integers to byte sequences used
>> by UTF-8 away from Unicode and use it for other purposes. One
>> could, for example, have a "UTF-8" encoding of the TRON indexed
>> character set, or of Nelson numbers. In this sense, there is
>> "UTF-8", the integer->byte sequence mapping, and UTF-8, the Unicode
>> transformation format that uses this mapping. This seems to me to
>> be a perfectly valid point. However, so as to avoid confusion, we
>> ought to call them different things, and since the "U" of "UTF-8"
>> stands for "Unicode", it is the mapping in the abstract that ought
>> to be given another name, perhaps the "Thompson mapping" or "diner
>> encoding".
>
> Oh, absolutely. You can use the transformation for anything you
> like, and modify it to suit your needs. You can extend it to cover
> the original 31-bit range, and to encode the values 0xD800 through
> 0xDFFF. You can even explain that it is derived from UTF-8.
>
> What you must not do, though, is call the resulting transformation
> "UTF-8," or anything that people will have a reasonable chance of
> confusing with the real UTF-8, such as "UTF-8X."

If wants an integer-to-byte sequence encoding, then it might be better
to design it differently than UTF-8, anyhow. If programs just forward
the byte sequences, there should be no problem. By if some
intermediate program would check for UTF-8 validity, that could cause
problems.

In the situation I had in mind, a byte sequence with no ties to C
strings or UTF-8 would be preferred, but the latter is forced by the
context (argument passing on a Unix computer). But there is an
interesting idea rather than a byte code, make an integer code, and
then use an integer-to-byte encoding, which then can be changed
according to context.

Hans

Next message: Andrew West: "Re: Old Italic in RTL ??"
Previous message: Hans Aberg: "Re: Invalid code points"
Maybe in reply to: Hans Aberg: "Re: Invalid code points"
Next in thread: Doug Ewell: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 02:35:53 CDT