Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jan 17 2005 - 18:57:08 CST

  • Next message: Asmus Freytag: "Re: 32'nd bit & UTF-8"

    [Warning: Your letter got high spam points for unknown reasons, and was
    classified as spam with me.]

    At 20:36 +0100 2005/01/17, Philippe VERDY wrote:
    >The standard UTF-8 encoding scheme version does not encode more than 21 bits.
    In fact it only encodes codepoints from U+0000 to U+10FFFF inclusive.
    >
    >(Note that this is exactly the same codespace as in the standard UTF-16 and
    UTF-32 encoding schemes).

    I am aware of that. See more details in my reply to Kenneth Whistler.

    >Everything else is out of the Unicode/ISO/IEC 10646 codespace, so it is
    excluded from from the encoding scheme.
    >You are refering to the UTF-8 transformation algorithm published in a old RFC
    which has been obsoleted since long. It's true that it is limited to
    transform only 31 bits at most (i.e. only non-negative values for signed
    32-bit numbers).

    Right.

    >An application that would use codepoints above U+10FFFF (or code units below 0
    or above 0x10FFFF) within strings would not be conforming to Unicode/ISO/IEC
    10646. These are not used to refer to any present and future
    Unicode/ISO/IEC-10646 characters. Forget it.

    It turns out that one can't do that, because each specific lexer needs to
    handle that as an error. And a lexer generator such as Flex can't know what
    will happen for each specific lexer. So it will have to implement some kind
    of general mechanism anyhow.

    >If you need to encode such data, don't label it as UTF-8, UTF-16, or UTF-32,
    but create your own encoding scheme, and don't expect interoperability for
    something that has no meaning in Unicode/ISO/IEC 10646...

    Right. That is one approach. But in a lexer generator it seems convenient to
    write
       [\u110000-\uffffffff] error ...
    And it might be good to have some conformance on that.

    >In UTF-8, the longest byte sequences is 4 bytes; there's no such 5-bytes or
    6-bytes sequences. Reread the Unicode standard, in the "conformance"
    section.

    I am well aware of that. But the other values will still have to be handled
    by the lexer generator, otherwise one cannot write proper error handling.

    >The old RFC you're refering to is not designating UTF-8, but UTF-BSS, which is
    >a transformation format,

    OK. Fine, so we have a name for it.

    >...but not an encoding scheme (an encoding scheme is the combination of an
    encoded character set, and a transformation format for transmission of
    arbitrary codes on streams of bytes; the encoding scheme needs to be
    reversible so that when decoding, it will return code points or code units
    within the codespace defined in the encoded charset; as the encoded charset
    in ISO-10646 is bounded to codepoints between 0 and 0x10FFFF, an encoding
    scheme restricts the transformation only to the code space used in the
    encoded charset, and so that's what the UTF-8 encoding scheme does).

    So using this terminology, I want the underlying UTF-BSS to handle all 32
    bits, not speculating on its use. The UTF-8 will still be restricted as the
    Unicode standard specifies.



    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 19:09:13 CST