From: Doug Ewell (email@example.com)
Date: Fri May 16 2003 - 02:23:02 EDT
Philippe Verdy <verdy_p at wanadoo dot fr> wrote:
> Don't forget other Unicode encoding forms: UTF-7, BOCU and SCSU also
> assign code units in the ASCII range. This is still the same Unicode
> encoding (for codepoints), but definitely not the same code units.
SCSU does use the ASCII bytes for the ASCII range, except for the
lesser-used C0 control characters. Specifically, NUL (0x00), HT (0x09),
LF (0x0A), and CR (0x0D) are represented as themselves in SCSU, while
other C0 controls are preceded by the SQ0 tag, 0x01.
Unfortunately, that "other" list includes FF (0x0C), which does appear
in a good deal of ASCII text. So it's not strictly true that SCSU
encodes ASCII as ASCII, but it's very close.
By contrast, BOCU-1 encodes *only* the C0 controls and SPACE as
themselves; all other ASCII characters are not represented as themselves
(and some are represented with different bytes in the ASCII range). You
can convert pure-ASCII text to BOCU-1 by leaving the spaces and controls
alone and adding 0x50 to everything else. (This is a gross
oversimplification, of course.)
UTF-7 is somewhere in the middle. Most ASCII characters are represented
with the same bytes, but some printable ASCII characters require an
BTW, in a couple of other messages you referred to "CESU" when I'm
pretty sure you meant SCSU. Don't drag CESU-8 into this discussion.
CESU-8 is a hack which applies the UTF-8 conversion to UTF-16 code units
instead of Unicode scalar values. It's not a compression format.
> One could argue that all *precisely defined* legacy character
> encodings (this includes the new GB2312 encoding) that work on subsets
> of Unicode are Unicode conformant, as they are encoding forms for
> their equivalent Unicode strings.
Again, be careful: the new Chinese standard you are thinking of is GB
18030. It is backward compatible with the older GB 2312.
This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 03:20:36 EDT