Re: Invalid code points

From: Hans Aberg (
Date: Sun May 31 2009 - 15:39:28 CDT

  • Next message: David Perry: "Old Italic in RTL ??"

    On 31 May 2009, at 22:18, Doug Ewell wrote:

    >>>> In particular, it would be great to know if the range U+0080, , U
    >>>> +009F is invalid.
    >>> That bit is especially wrong. I can at least imagine why there
    >>> might be confusion about the noncharacters and surrogate code
    >>> points, but not the C1 controls.
    >> It is a bit disappointing: I was looking for a beginning (escape)
    >> byte sequence to tell that string isn't UTF-8, among other valid
    >> strings. But perhaps it does not matter.
    > If you're thinking about inventing one, for your own use, then any
    > byte sequence that is not valid UTF-8 should do the job. One
    > possibility would be {0xA0}.

    Thank you for the suggestion.

    > Be sure you understand the difference between an invalid *byte
    > sequence* and an invalid *code point*. There are many invalid byte
    > sequences in UTF-8. As Mark pointed out, the only invalid code
    > points are the surrogates.

    Yes, I am thinking about both possibilities. The idea is in an
    environment of C strings, '\0' terminated then, also pass some byte
    code objects for those programs that can parse it.

    > The section of the Wikipedia article you cited actually contains
    > quite a concentration of misleading information:

    Yes, that is quite of a mess. I think also strictly speaking there are
    two UTF-8s: one which does not have the integer limitations that are
    used in Unicode. This could be used to convert integers sequences into
    byte sequences which then do not have Unicode character
    interpretation. So I like to think of Unicode UTF-8 composed of two
    parts: one natural number to byte-sequence conversion, which is the
    real UTF-8, and on top of that, and interpretation of the natural
    numbers as Unicode characters, which as such do not have anything to
    do with this natural number-to-byte conversion.


    This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:41:44 CDT