Re: Unicode conformant character encodings and us-ascii

From: Doug Ewell (
Date: Sat May 17 2003 - 17:09:15 EDT

  • Next message: Jim Allan: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"

    Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

    >> Are not BE and LE regarded as different encoding forms, making five
    >> encoding forms (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE & UTF-32LE)?
    > This was true up to Unicode 3.x, but not since Unicode 4.0.0 which
    > makes a distinction between encoding forms (that are serialization of
    > code points to an ordered list of fixed-width code units), and
    > encoding schemes (that are serialization of these encoding forms to
    > ordered streams of bytes, taking into account the byte ordering).

    This distinction is not new with Unicode 4.0. It has been in place
    since UTR #17, "Character Encoding Model," was first published in 1999.

    > There are also other non-standardized encoding forms in Unicode
    > Technical reports (BOCU and SCSU which use 8-bit code units like
    > UTF-8, and use a trivial serialization to bytes), plus an additional
    > encoding scheme (CESU, which generates streams of bytes from the
    > UTF-16 encoding form).

    SCSU and BOCU-1 are Transfer Encoding Syntaxes, not Character Encoding
    Schemes and certainly not Character Encoding Forms. The distinction
    really is important, even if it doesn't appear so at first.

    BTW, another terminology point:

    "BOCU" is the general name for the compression technique that involves
    (a) encoding the difference of each code point from the previous, (b)
    adjustment of the "previous" value to improve efficiency, and (c)
    encoding the resulting difference in such a way as to preserve binary
    ordering (and possibly also achieve other TES-like goals). "BOCU-1,"
    described in Technical Note #6, is a specific implementation of the BOCU

    > UTF-7 is also described in a Technical Report

    No it isn't. It's described in RFC 2152.

    > and is an encoding form (to 7-bit code units) with trivial encoding
    > scheme to bytes (not so trivial, because parity bits can be set for
    > the high bit for streamed transmission purpose, that can safely be
    > changed/ignored, so there are equivalent bytes).

    Neither UTF-7 nor SCSU nor BOCU-1 is an encoding form. Please, please
    read UTR #17 or the equivalent Unicode 4.0 text and make sure you
    understand the distinctions drawn therein.

    > ISO2022, GB18030 and the new JIS standard are mostly viewed as
    > encoding forms with a trivial identity encoding scheme because the
    > code units are 8 bits.

    ISO 2022 lives in a *completely different world* from Unicode/10646. It
    is not a Unicode CEF, CES, or TES. It is best not to think of ISO 2022
    and Unicode as being related in any way, even if there are ISO 2022
    escape sequences to switch into and out of UTF-8.

    What new JIS standard? Did I miss something else?

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 18:03:19 EDT