Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jan 17 2005 - 18:57:08 CST

Next message: Asmus Freytag: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Philippe Verdy: "Re: 32'nd bit & UTF-8"
Reply: Philippe Verdy: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

[Warning: Your letter got high spam points for unknown reasons, and was
classified as spam with me.]

At 20:36 +0100 2005/01/17, Philippe VERDY wrote:
>The standard UTF-8 encoding scheme version does not encode more than 21 bits.
In fact it only encodes codepoints from U+0000 to U+10FFFF inclusive.
>
>(Note that this is exactly the same codespace as in the standard UTF-16 and
UTF-32 encoding schemes).

I am aware of that. See more details in my reply to Kenneth Whistler.

>Everything else is out of the Unicode/ISO/IEC 10646 codespace, so it is
excluded from from the encoding scheme.
>You are refering to the UTF-8 transformation algorithm published in a old RFC
which has been obsoleted since long. It's true that it is limited to
transform only 31 bits at most (i.e. only non-negative values for signed
32-bit numbers).

Right.

>An application that would use codepoints above U+10FFFF (or code units below 0
or above 0x10FFFF) within strings would not be conforming to Unicode/ISO/IEC
10646. These are not used to refer to any present and future
Unicode/ISO/IEC-10646 characters. Forget it.

It turns out that one can't do that, because each specific lexer needs to
handle that as an error. And a lexer generator such as Flex can't know what
will happen for each specific lexer. So it will have to implement some kind
of general mechanism anyhow.

>If you need to encode such data, don't label it as UTF-8, UTF-16, or UTF-32,
but create your own encoding scheme, and don't expect interoperability for
something that has no meaning in Unicode/ISO/IEC 10646...

Right. That is one approach. But in a lexer generator it seems convenient to
write
[\u110000-\uffffffff] error ...
And it might be good to have some conformance on that.

>In UTF-8, the longest byte sequences is 4 bytes; there's no such 5-bytes or
6-bytes sequences. Reread the Unicode standard, in the "conformance"
section.

I am well aware of that. But the other values will still have to be handled
by the lexer generator, otherwise one cannot write proper error handling.

>The old RFC you're refering to is not designating UTF-8, but UTF-BSS, which is
>a transformation format,

OK. Fine, so we have a name for it.

>...but not an encoding scheme (an encoding scheme is the combination of an
encoded character set, and a transformation format for transmission of
arbitrary codes on streams of bytes; the encoding scheme needs to be
reversible so that when decoding, it will return code points or code units
within the codespace defined in the encoded charset; as the encoded charset
in ISO-10646 is bounded to codepoints between 0 and 0x10FFFF, an encoding
scheme restricts the transformation only to the code space used in the
encoded charset, and so that's what the UTF-8 encoding scheme does).

So using this terminology, I want the underlying UTF-BSS to handle all 32
bits, not speculating on its use. The UTF-8 will still be restricted as the
Unicode standard specifies.

Next message: Asmus Freytag: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Philippe Verdy: "Re: 32'nd bit & UTF-8"
Reply: Philippe Verdy: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 19:09:13 CST