Re: Invalid code points

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jun 01 2009 - 02:33:07 CDT

  • Next message: Andrew West: "Re: Old Italic in RTL ??"

    On 1 Jun 2009, at 03:50, Doug Ewell wrote:

    >> If I understand Hans Aberg's point, he means that one can abstract
    >> the mapping from the non-negative integers to byte sequences used
    >> by UTF-8 away from Unicode and use it for other purposes. One
    >> could, for example, have a "UTF-8" encoding of the TRON indexed
    >> character set, or of Nelson numbers. In this sense, there is
    >> "UTF-8", the integer->byte sequence mapping, and UTF-8, the Unicode
    >> transformation format that uses this mapping. This seems to me to
    >> be a perfectly valid point. However, so as to avoid confusion, we
    >> ought to call them different things, and since the "U" of "UTF-8"
    >> stands for "Unicode", it is the mapping in the abstract that ought
    >> to be given another name, perhaps the "Thompson mapping" or "diner
    >> encoding".
    >
    > Oh, absolutely. You can use the transformation for anything you
    > like, and modify it to suit your needs. You can extend it to cover
    > the original 31-bit range, and to encode the values 0xD800 through
    > 0xDFFF. You can even explain that it is derived from UTF-8.
    >
    > What you must not do, though, is call the resulting transformation
    > "UTF-8," or anything that people will have a reasonable chance of
    > confusing with the real UTF-8, such as "UTF-8X."

    If wants an integer-to-byte sequence encoding, then it might be better
    to design it differently than UTF-8, anyhow. If programs just forward
    the byte sequences, there should be no problem. By if some
    intermediate program would check for UTF-8 validity, that could cause
    problems.

    In the situation I had in mind, a byte sequence with no ties to C
    strings or UTF-8 would be preferred, but the latter is forced by the
    context (argument passing on a Unix computer). But there is an
    interesting idea rather than a byte code, make an integer code, and
    then use an integer-to-byte encoding, which then can be changed
    according to context.

       Hans



    This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 02:35:53 CDT