Re: Invalid code points

From: William J Poser (
Date: Sun May 31 2009 - 19:26:41 CDT

  • Next message: Asmus Freytag: "Re: Old Italic in RTL ??"

    > There is only one UTF-8, the one defined by Unicode and ISO/IEC 10646,
    >which maps valid Unicode/10646 scalar values to sequences of bytes.
    >Anything else is not UTF-8. Keep repeating this to yourself.

    If I understand Hans Aberg's point, he means that one can abstract
    the mapping from the non-negative integers to byte sequences used by
    UTF-8 away from Unicode and use it for other purposes. One could,
    for example, have a "UTF-8" encoding of the TRON indexed character
    set, or of Nelson numbers. In this sense, there is "UTF-8", the
    integer->byte sequence mapping, and UTF-8, the Unicode transformation
    format that uses this mapping. This seems to me to be a perfectly valid point.
    However, so as to avoid confusion, we ought to call them different
    things, and since the "U" of "UTF-8" stands for "Unicode", it is the
    mapping in the abstract that ought to be given another name, perhaps
    the "Thompson mapping" or "diner encoding".


    This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 19:28:31 CDT