Re: Unicode conformant character encodings and us-ascii

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 16 2003 - 15:33:45 EDT

  • Next message: Peter_Constable@sil.org: "Re: Unicode conformant character encodings and us-ascii"

    Philippe Verdy stated:

    > Unicode only defines codepoints, not their serialization into
    > code units and not technical aspect such as byte order (which
    > is important for UTF-16 and UTF-32, also used to encode subsets
    > or sursets of Unicode such as the old UCS2 (which is just a
    > restriction of Unicode to the BMP but does not define a specific
    > serialization).

    Doug Ewell already responded to some of the issues in this post,
    but a few more issues need some rectification.

    In the above paragraph, I think there is a confusion which results
    from unclear usage of the phrase "Unicode defines...".

    If understand as "the character encoding of the Unicode Standard
    only defines code points...", then that is correct. The character
    encoding per se is just the assignment of code points to abstract
    characters.

    If, however, understood as "The Unicode Standard only defines
    code points, not their serialization into code units..." then
    that is clearly incorrect on several grounds.

    First, the Unicode Standard *does* also define encoding forms
    and their code units, and also defines encoding schemes and
    the byte serializations they use.

    Second, code points are not *serialized* into code units.
    Serialization is an issue for encoding schemes, and is the
    serialization of the code units into byte sequences. Again,
    see Chapter 3 of The Unicode Standard, Version 4.0 for all
    the details.

    >
    > One could argue that all *precisely defined* legacy character
    > encodings (this includes the new GB2312 encoding)

    As Doug pointed out, Philippe probably means GB 18030 here.
    GB 2312 is an *old* character encoding standard. It was published
    in 1981.

    > that work on subsets of Unicode are Unicode conformant,

    This is a misapplication of the term "Unicode-conformant".
    Legacy character encoding standards outside the context
    of the Unicode Standard (and, indeed, often published before
    there even was a Unicode Standard), cannot be conformant
    to the Unicode Standard.

    What I think Philippe is trying to indicate here is that
    other character encodings which have repertoires that are
    strict subsets of the Unicode Standard can *interoperate*
    with implementations of the Unicode Standard.
     
    > as they are encoding forms for their equivalent Unicode
    > strings. However they must be considered as distinct
    > encodings and character sets, because they cannot represent
    > exactly all Unicode strings (including its non normalized forms).

    There should be no question that other character encodings
    are distinct character encodings. ;-)

    The point seems to be that other legacy character encodings
    have only a subset of the character repertoire of the
    Unicode Standard, and thus cannot represent all Unicode
    characters.

    >
    > However ISO2022 is conforming with Unicode,

    This is *not* the case.

    > and can be seen as an alternative for general purpose Unicode
    > encoding forms,

    This is also *not* the case.

    > because of its ability to switch to many
    > encoding forms including UTF* encoding forms.

    I think what Philippe is trying to claim here is that by
    use of ISO 2022 (and multiple, individual character sets
    registered for use with ISO 2022, including ISO/IEC 10646,
    of course), one can represent a large number of characters.
    That is certainly true. And since ISO/IEC 10646, including
    UTF-8 or UTF-16, can be used in the ISO 2022 framework, it
    is trivially true that one can represent all Unicode
    characters in an ISO 2022 framework. One can simply announce
    UTF-8, e.g. with:

       ESC %/I (0x1B 0x25 0x2F 0x49)
       
    and then merrily continue with a UTF-8 data stream for
    as long as one likes.

    > The difference
    > is that its full implementation is extremely complex as it is
    > based on a repertoire of encodings not defined by Unicode, and
    > requires a lot of specific parsers for each supported subsets
    > and subencoding.

    That certainly seems true to me. Nobody is going to dispute
    that ISO 2022 implementations have complex character
    handling requirements.

    --Ken

    >
    >



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 16:19:57 EDT