Re: Unicode conformant character encodings and us-ascii

From: Doug Ewell (dewell@adelphia.net)
Date: Fri May 16 2003 - 02:23:02 EDT

Next message: Doug Ewell: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"

Previous message: Mark Davis: "Proposed Update of UTR #18: Unicode Regular Expressions"
In reply to: Philippe Verdy: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Peter_Constable@sil.org: "Re: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

> Don't forget other Unicode encoding forms: UTF-7, BOCU and SCSU also
> assign code units in the ASCII range. This is still the same Unicode
> encoding (for codepoints), but definitely not the same code units.

SCSU does use the ASCII bytes for the ASCII range, except for the
lesser-used C0 control characters. Specifically, NUL (0x00), HT (0x09),
LF (0x0A), and CR (0x0D) are represented as themselves in SCSU, while
other C0 controls are preceded by the SQ0 tag, 0x01.

Unfortunately, that "other" list includes FF (0x0C), which does appear
in a good deal of ASCII text. So it's not strictly true that SCSU
encodes ASCII as ASCII, but it's very close.

By contrast, BOCU-1 encodes *only* the C0 controls and SPACE as
themselves; all other ASCII characters are not represented as themselves
(and some are represented with different bytes in the ASCII range). You
can convert pure-ASCII text to BOCU-1 by leaving the spaces and controls
alone and adding 0x50 to everything else. (This is a gross
oversimplification, of course.)

UTF-7 is somewhere in the middle. Most ASCII characters are represented
with the same bytes, but some printable ASCII characters require an
escape sequence.

BTW, in a couple of other messages you referred to "CESU" when I'm
pretty sure you meant SCSU. Don't drag CESU-8 into this discussion.
CESU-8 is a hack which applies the UTF-8 conversion to UTF-16 code units
instead of Unicode scalar values. It's not a compression format.

> One could argue that all *precisely defined* legacy character
> encodings (this includes the new GB2312 encoding) that work on subsets
> of Unicode are Unicode conformant, as they are encoding forms for
> their equivalent Unicode strings.

Again, be careful: the new Chinese standard you are thinking of is GB
18030. It is backward compatible with the older GB 2312.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Doug Ewell: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"
Previous message: Mark Davis: "Proposed Update of UTR #18: Unicode Regular Expressions"
In reply to: Philippe Verdy: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Peter_Constable@sil.org: "Re: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 03:20:36 EDT