Re: Unicode conformant character encodings and us-ascii

From: Doug Ewell (dewell@adelphia.net)
Date: Sat May 17 2003 - 17:09:15 EDT

Next message: Jim Allan: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"

Previous message: Philippe Verdy: "Re: Decimal separator with more than one character?"
In reply to: Philippe Verdy: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Addison Phillips [wM]: "RE: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

>> Are not BE and LE regarded as different encoding forms, making five
>> encoding forms (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE & UTF-32LE)?
>
> This was true up to Unicode 3.x, but not since Unicode 4.0.0 which
> makes a distinction between encoding forms (that are serialization of
> code points to an ordered list of fixed-width code units), and
> encoding schemes (that are serialization of these encoding forms to
> ordered streams of bytes, taking into account the byte ordering).

This distinction is not new with Unicode 4.0. It has been in place
since UTR #17, "Character Encoding Model," was first published in 1999.

> There are also other non-standardized encoding forms in Unicode
> Technical reports (BOCU and SCSU which use 8-bit code units like
> UTF-8, and use a trivial serialization to bytes), plus an additional
> encoding scheme (CESU, which generates streams of bytes from the
> UTF-16 encoding form).

SCSU and BOCU-1 are Transfer Encoding Syntaxes, not Character Encoding
Schemes and certainly not Character Encoding Forms. The distinction
really is important, even if it doesn't appear so at first.

BTW, another terminology point:

"BOCU" is the general name for the compression technique that involves
(a) encoding the difference of each code point from the previous, (b)
adjustment of the "previous" value to improve efficiency, and (c)
encoding the resulting difference in such a way as to preserve binary
ordering (and possibly also achieve other TES-like goals). "BOCU-1,"
described in Technical Note #6, is a specific implementation of the BOCU
technique.

> UTF-7 is also described in a Technical Report

No it isn't. It's described in RFC 2152.

> and is an encoding form (to 7-bit code units) with trivial encoding
> scheme to bytes (not so trivial, because parity bits can be set for
> the high bit for streamed transmission purpose, that can safely be
> changed/ignored, so there are equivalent bytes).

Neither UTF-7 nor SCSU nor BOCU-1 is an encoding form. Please, please
read UTR #17 or the equivalent Unicode 4.0 text and make sure you
understand the distinctions drawn therein.

> ISO2022, GB18030 and the new JIS standard are mostly viewed as
> encoding forms with a trivial identity encoding scheme because the
> code units are 8 bits.

ISO 2022 lives in a *completely different world* from Unicode/10646. It
is not a Unicode CEF, CES, or TES. It is best not to think of ISO 2022
and Unicode as being related in any way, even if there are ISO 2022
escape sequences to switch into and out of UTF-8.

What new JIS standard? Did I miss something else?

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Jim Allan: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"
Previous message: Philippe Verdy: "Re: Decimal separator with more than one character?"
In reply to: Philippe Verdy: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Addison Phillips [wM]: "RE: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 18:03:19 EDT