Re: Unicode conformant character encodings and us-ascii

From: Doug Ewell (dewell@adelphia.net)
Date: Fri May 16 2003 - 02:23:02 EDT

  • Next message: Doug Ewell: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"

    Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

    > Don't forget other Unicode encoding forms: UTF-7, BOCU and SCSU also
    > assign code units in the ASCII range. This is still the same Unicode
    > encoding (for codepoints), but definitely not the same code units.

    SCSU does use the ASCII bytes for the ASCII range, except for the
    lesser-used C0 control characters. Specifically, NUL (0x00), HT (0x09),
    LF (0x0A), and CR (0x0D) are represented as themselves in SCSU, while
    other C0 controls are preceded by the SQ0 tag, 0x01.

    Unfortunately, that "other" list includes FF (0x0C), which does appear
    in a good deal of ASCII text. So it's not strictly true that SCSU
    encodes ASCII as ASCII, but it's very close.

    By contrast, BOCU-1 encodes *only* the C0 controls and SPACE as
    themselves; all other ASCII characters are not represented as themselves
    (and some are represented with different bytes in the ASCII range). You
    can convert pure-ASCII text to BOCU-1 by leaving the spaces and controls
    alone and adding 0x50 to everything else. (This is a gross
    oversimplification, of course.)

    UTF-7 is somewhere in the middle. Most ASCII characters are represented
    with the same bytes, but some printable ASCII characters require an
    escape sequence.

    BTW, in a couple of other messages you referred to "CESU" when I'm
    pretty sure you meant SCSU. Don't drag CESU-8 into this discussion.
    CESU-8 is a hack which applies the UTF-8 conversion to UTF-16 code units
    instead of Unicode scalar values. It's not a compression format.

    > One could argue that all *precisely defined* legacy character
    > encodings (this includes the new GB2312 encoding) that work on subsets
    > of Unicode are Unicode conformant, as they are encoding forms for
    > their equivalent Unicode strings.

    Again, be careful: the new Chinese standard you are thinking of is GB
    18030. It is backward compatible with the older GB 2312.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 03:20:36 EDT