UTF-8 nitpicking (was: RE: any unicode conversion tools?)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 13 2004 - 15:27:29 CDT

  • Next message: Michael Everson: "Re: Interleaved collation of related scripts"


    It's time to nitpick the nitpicker. ;-)

    > 1. UCS-4, which is still defined by 10646 (but never by Unicode)
    > is limited at U-7FFF FFFF


    The space in "U-7FFF FFFF" is a Swedishism, not specified in
    the standard. The "U" and the "-" are designated as both
    optional for the 8-digit hex representation, leading to the
    marvelous anacronism of 10646 specifying an allowed UCS-4
    short identifier "-7FFFFFFF" for a positive (unsigned)
    hex value of 0x7FFFFFFF.

    > (nitpick: for some reason it's "U-"
    > not "U+"; don't ask me why).

    A WG2 committee "bright idea" that it was necessary to distinguish
    the UCS-4 and UCS-2 (later UTF-16) short identifier formally,
    promptly vitiated by making the "U" and the "-" or "+" optional,

    > U-FFFF FFFF has always been
    > out of range. Probably so that one could use "signed" 32-bit
    > ints (not all p.l. have unsigned integer types).
    > 2. That "original" definition of UTF-8 (which was never in Unicode)
    > is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
    > not the same as UTF-8/10646.

    This is not quite right. If you look at FSS-UTF, defined in
    Unicode *1.1*, it is basically identical to the current
    Annex D of 10646 (UTF-8). It defined the entire algorithm,
    with the specification out to 6 byte sequences. The sample
    implementation included in the text handled 6 byte sequences,
    and value up to 0x7FFFFFFF.

    Unicode 1.1 had a footnote in the FSS-UTF section which stated:

    "Unicode only requires values up to FFFF and so only uses
    multi-byte characters of lengths up to 3, but for completeness
    the full ranges of the format are described."

    Recall that as of Version 1.1, "Unicode" was still synonymous
    with UCS-2 -- the incorporation of UTF-16 into the definition
    of Unicode didn't happen until Unicode 2.0.

    In Unicode 2.0, "UTF-8" was then mapped against "Unicode",
    i.e. UTF-16 ("Unicode with surrogate pairs"). The mapping only
    required use of UTF-8 sequences up to 4 bytes long, of course.
    But the sample implementation *still* handled up to 6-byte
    sequences, identically to the definition in 10646. So the
    *algorithmic* definition of UTF-8 was still identical to 10646.

    It was only with Unicode 3.0 (and the correlated 10646-1:2000)
    that this was rationalized to the Unicode definition of
    UTF-8 formally consisting of only 1-4 bytes sequences, while
    simultaneously the potential need for 5 and 6-byte sequences
    in 10646 was removed, because of the removal of any private
    use planes past U+10FFFF in 10646.

    > In practice it does not matter
    > very much, since there are no (and will never be) any characters
    > allocated above U+10FFFF, and the private use planes above
    > U+10FFFF (which were specified in 10646) have been removed.



    This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 15:28:15 CDT