UTR#17 comments (was RE: Unicode Public Review Issues update)

From: Peter Constable (petercon@microsoft.com)
Date: Fri Nov 28 2003 - 12:30:43 EST

  • Next message: Peter Constable: "Ethiopic numbers (was RE: Unicode Public Review Issues update)"

    > -----Original Message-----
    > From: unicore-bounce@unicode.org [mailto:unicore-bounce@unicode.org]
    On Behalf
    > Of Rick McGowan

    > The following public review issues are new:
    > 25 Proposed Update UTR #17 Character Encoding Model 2004.01.27

    I have submitted the following comments, copied here in case anyone
    wishes to discuss them:

    The draft text for TR17, section 5 says, "A simple character encoding
    scheme is a mapping of each code unit of a CCS into a unique serialized
    byte sequence." It goes on to define a compound CES. While not stated
    explicitly, Unicodes CESs do not fit the definition of a compound CES,
    and so the definition for simple CES must apply.

    The problem is that this definition cannot accommodate all seven Unicode
    CESs. Since it defines a CES as a mapping from each code unit, there are
    only two possible byte-order-dependent mappings for 16- and 32-bit code
    units. In other words, the distinction between UTF-16BE and UTF-16 data
    that is big-endian cannot be a CES distinction because individual code
    units are mapped in exactly the same way in both cases.

    A definition for simple CES must, at a minimum, refer to a mapping of
    *streams* of code units if it is to include details about a byte-order
    mark that may or may not occur at the beginning of a stream.

    I would suggest that, in order to accommodate the UTF-16 and UTF-32
    CESs, an appropriate definition should actually be a level of
    abstraction away from "a mapping": a CES is a specification for
    mappings. Any mapping is necessarily deterministic, giving a specific
    output for each input. A mapping itself cannot serialize "in either
    big-endian or little-endian format"; it must be one or the other,
    unambiguously. On the other hand, a specification for how to map into
    byte sequences can be ambiguous in this regard. Thus, the UTF-16 CES can
    be considered a specification for mapping into byte sequences that
    allows a little-endian mapping or a big-endian mapping.

    Peter Constable
    Globalization Infrastructure and Font Technologies
    Microsoft Windows Division

    This archive was generated by hypermail 2.1.5 : Fri Nov 28 2003 - 13:14:50 EST