Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:49:51 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    Hans Aberg wrote:

    > >> You probaly mean that the overloaded UTF-BSS (or whatever the correct name
    > >> is)

    O.k., can we officially retire all the discussion of the nonexistent
    name "UTF-BSS", which was an artifact of Philippe Verdy not correctly
    recalling the name of "FSS-UTF" when he originally wrote a response
    on this thread??

    > >
    > > I wonder if there's a "correct name" for it. It seems that the most correct
    > > name for this traforms would be the reference to the old RFC describing it,
    > > even if the title of the informative RFC gives "UTF-8" incorrectly; and even
    > > if there's a symbolic name to refer it, but only as a local symbol pointing
    > > to the bibliographic reference at end of the text.
    >
    > I think there is a gap in the standards to not give it a name.

    Lookalike extensions of the bit-shifting principles used in UTF-8
    to extend the scheme to being a way of converting 32-bit numbers
    in general into byte streams that masquerade as UTF-8, and acquire
    "BS" monikers like UTF-8BS, or CPBTF-8, or whatever, are *NOT*
    welcome additions. They are pernicious, because they would inflict
    on information processing applications byte streams that walk and
    quack like UTF-8 ducks but are not, in fact, ducks.

    > It makes
    > discussions as it here difficult. Generally, standards just define what is
    > legal, and does not provide names for what is outside it.

    Read again. The Unicode Standard defines both unassigned code points
    (valid code points that have not been designated a function, either
    as an encoded character or some other function such as surrogate
    code point) *and* it defines *ill-formed* code units in the character
    encoding schemes, UTF-8, UTF-16, and UTF-32.

    0xFF is an ill-formed code unit in UTF-8. Clearly defined, and clearly
    given a name by the standard.

    TUS 4.0, p. 76:

      "Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed."

    > A name like
    > CPBTF-8 ("code point to binary transformation format") seems more
    > appropriate, since it not a transformation dealing with characters at all,
    > but only dealing with how to transform code points into bytes.

    This is an invalid distinction.

    Definition D29 in TUS, 4.0, p. 74:

    "D29 A Unicode encoding form assigns each Unicode scalar value to a
    unique code unit sequence."

    It is *not* "a transformation dealing with characters", but a mapping
    between Unicode scalar values (short hand for, and synonymous
    to 0000..D7FF, E000..10FFFF) to code unit sequences (bytes in the
    case of UTF-8, 16-bit units [wydes] in the case of UTF-16, and
    32-bit words in the case of UTF-32).

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:51:38 CST