UTF-8 nitpicking (was: RE: any unicode conversion tools?)

From: Kenneth Whistler ([email protected])
Date: Thu May 13 2004 - 15:27:29 CDT

Next message: Michael Everson: "Re: Interleaved collation of related scripts"

Previous message: Dean Snyder: "RE: interleaved ordering (was RE: Phoenician)"
Next in thread: [email protected]: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Reply: [email protected]: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Maybe reply: Kenneth Whistler: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kent,

It's time to nitpick the nitpicker. ;-)

> 1. UCS-4, which is still defined by 10646 (but never by Unicode)
> is limited at U-7FFF FFFF

U-7FFFFFFF (~ U7FFFFFFF ~ 7FFFFFFF ~ -7FFFFFFF [!])

The space in "U-7FFF FFFF" is a Swedishism, not specified in
the standard. The "U" and the "-" are designated as both
optional for the 8-digit hex representation, leading to the
marvelous anacronism of 10646 specifying an allowed UCS-4
short identifier "-7FFFFFFF" for a positive (unsigned)
hex value of 0x7FFFFFFF.

> (nitpick: for some reason it's "U-"
> not "U+"; don't ask me why).

A WG2 committee "bright idea" that it was necessary to distinguish
the UCS-4 and UCS-2 (later UTF-16) short identifier formally,
promptly vitiated by making the "U" and the "-" or "+" optional,
anyway.

> U-FFFF FFFF has always been
> out of range. Probably so that one could use "signed" 32-bit
> ints (not all p.l. have unsigned integer types).
>
> 2. That "original" definition of UTF-8 (which was never in Unicode)
> is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
> not the same as UTF-8/10646.

This is not quite right. If you look at FSS-UTF, defined in
Unicode *1.1*, it is basically identical to the current
Annex D of 10646 (UTF-8). It defined the entire algorithm,
with the specification out to 6 byte sequences. The sample
implementation included in the text handled 6 byte sequences,
and value up to 0x7FFFFFFF.

Unicode 1.1 had a footnote in the FSS-UTF section which stated:

"Unicode only requires values up to FFFF and so only uses
multi-byte characters of lengths up to 3, but for completeness
the full ranges of the format are described."

Recall that as of Version 1.1, "Unicode" was still synonymous
with UCS-2 -- the incorporation of UTF-16 into the definition
of Unicode didn't happen until Unicode 2.0.

In Unicode 2.0, "UTF-8" was then mapped against "Unicode",
i.e. UTF-16 ("Unicode with surrogate pairs"). The mapping only
required use of UTF-8 sequences up to 4 bytes long, of course.
But the sample implementation *still* handled up to 6-byte
sequences, identically to the definition in 10646. So the
*algorithmic* definition of UTF-8 was still identical to 10646.

It was only with Unicode 3.0 (and the correlated 10646-1:2000)
that this was rationalized to the Unicode definition of
UTF-8 formally consisting of only 1-4 bytes sequences, while
simultaneously the potential need for 5 and 6-byte sequences
in 10646 was removed, because of the removal of any private
use planes past U+10FFFF in 10646.

> In practice it does not matter
> very much, since there are no (and will never be) any characters
> allocated above U+10FFFF, and the private use planes above
> U+10FFFF (which were specified in 10646) have been removed.

Correct.

--Ken

Next message: Michael Everson: "Re: Interleaved collation of related scripts"
Previous message: Dean Snyder: "RE: interleaved ordering (was RE: Phoenician)"
Next in thread: [email protected]: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Reply: [email protected]: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Maybe reply: Kenneth Whistler: "Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 15:28:15 CDT