From: Hans Aberg (email@example.com)
Date: Mon Jan 17 2005 - 18:57:08 CST
[Warning: Your letter got high spam points for unknown reasons, and was
classified as spam with me.]
At 20:36 +0100 2005/01/17, Philippe VERDY wrote:
>The standard UTF-8 encoding scheme version does not encode more than 21 bits.
In fact it only encodes codepoints from U+0000 to U+10FFFF inclusive.
>(Note that this is exactly the same codespace as in the standard UTF-16 and
UTF-32 encoding schemes).
I am aware of that. See more details in my reply to Kenneth Whistler.
>Everything else is out of the Unicode/ISO/IEC 10646 codespace, so it is
excluded from from the encoding scheme.
>You are refering to the UTF-8 transformation algorithm published in a old RFC
which has been obsoleted since long. It's true that it is limited to
transform only 31 bits at most (i.e. only non-negative values for signed
>An application that would use codepoints above U+10FFFF (or code units below 0
or above 0x10FFFF) within strings would not be conforming to Unicode/ISO/IEC
10646. These are not used to refer to any present and future
Unicode/ISO/IEC-10646 characters. Forget it.
It turns out that one can't do that, because each specific lexer needs to
handle that as an error. And a lexer generator such as Flex can't know what
will happen for each specific lexer. So it will have to implement some kind
of general mechanism anyhow.
>If you need to encode such data, don't label it as UTF-8, UTF-16, or UTF-32,
but create your own encoding scheme, and don't expect interoperability for
something that has no meaning in Unicode/ISO/IEC 10646...
Right. That is one approach. But in a lexer generator it seems convenient to
[\u110000-\uffffffff] error ...
And it might be good to have some conformance on that.
>In UTF-8, the longest byte sequences is 4 bytes; there's no such 5-bytes or
6-bytes sequences. Reread the Unicode standard, in the "conformance"
I am well aware of that. But the other values will still have to be handled
by the lexer generator, otherwise one cannot write proper error handling.
>The old RFC you're refering to is not designating UTF-8, but UTF-BSS, which is
>a transformation format,
OK. Fine, so we have a name for it.
>...but not an encoding scheme (an encoding scheme is the combination of an
encoded character set, and a transformation format for transmission of
arbitrary codes on streams of bytes; the encoding scheme needs to be
reversible so that when decoding, it will return code points or code units
within the codespace defined in the encoded charset; as the encoded charset
in ISO-10646 is bounded to codepoints between 0 and 0x10FFFF, an encoding
scheme restricts the transformation only to the code space used in the
encoded charset, and so that's what the UTF-8 encoding scheme does).
So using this terminology, I want the underlying UTF-BSS to handle all 32
bits, not speculating on its use. The UTF-8 will still be restricted as the
Unicode standard specifies.
This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 19:09:13 CST