Re: ISO 2022

From: Glenn Adams (glenn@cam.spyglass.com)
Date: Thu Oct 23 1997 - 12:44:37 EDT


>Character sets
>have one of three sizes: single-byte character sets with 94 characters
>(e.g. ASCII), single-byte character sets with 96 characters (e.g. the top
>halves of ISO Latin-1 to Latin-5), or double-byte character sets with
>94 x 94 characters (e.g. JIS 0208X-1983).

multi-byte character sets are not limited to two bytes per character;
the column of the final character of the designating sequence determines the
number of bytes as follows:

col # bytes
3 2 or more (private use, non-standard final)
4 2 (standard final)
5 2 (standard final)
6 3 (standard final)
7 4 or more (standard final)

also, 96x96 character sets are possible and may designated to
G1-G3 via:

ESC 2/4 2/13 2/0 (Im) F multi-byte 96-character G1 DRCS
ESC 2/4 2/13 (Im In) F multi-byte 96-character G1
ESC 2/4 2/14 2/0 (Im) F multi-byte 96-character G2 DRCS
ESC 2/4 2/14 (Im In) F multi-byte 96-character G2
ESC 2/4 2/15 2/0 (Im) F multi-byte 96-character G3 DRCS
ESC 2/4 2/15 (Im In) F multi-byte 96-character G3

DRCS = dynamically redefinable character set (when used, Im and F
are not standardized, but are private)

>Each registered character set has
>a standard designating byte in the range 48 to 125; the bytes are
>unique within character set sizes, but may be reused across sizes.

What you refer to as the 'designating byte' is the 'final byte',
which may be from 3/0 (48) to 7/14 (126); final bytes in the range
3/0 to 3/15 are for private use and are referred to with the symbol
"Fp"; final bytes in the range 4/0 to 7/14 are for standardized use
and are referred to with the symbol "Ft". In addition, one or more
preceding intermediate bytes in the range 2/0 to 2/15 may be specified
for standard use to extend the number of identified character sets,
and are referred to with the symbol "In". Note that with DRCS
designations, the suffix bytes (In F) are always interpreted as private
use.

>Initially, G0 is the 94-character set ASCII, and G1 is the 96-character
>set ISO Latin-1 (top half).

This is incorrect; ISO-2022 does not specify a default designation for
either G0 or G1. They have to be explicitly designated, implied
through a standardized announcer, or implied by a prior agreement between
two communicating parties (i.e., a higher level protocol). It is
true that many implementations assume what you say is true, but they
are assuming too much, technically speaking.

>The other character sets are unassigned.

The correct term is "undesignated" rather than "unassigned".

>ESC ( <D> Set G0 to the 94-character set <D>

You introduce a non-standard notation here. The stanard notation
is:

ESC 2/8 (In Im) F designate 94-character G0 set

It is best to use the column/row notation (alternatively a hex
octet value) to emphasize that octets in escape sequences are
not interpreted as characters (i.e., 2/8 != '(').

>ESC $ <D> Set G0 to the 94 x 94 character set <D>

This older sequence is now non-standard (as you point out). The
only standardized sequences using this form are:

ESC 2/4 4/0 JIS X 0208 1978 to 94x94 G0
ESC 2/4 4/1 GB 2312 1980 to 94x94 G0
ESC 2/4 4/2 JIS X 0208 1983 to 94x94 G0

These sequences were standardized under in ISO 2022:1973 in which multi-byte
sets could be designated to G0 only.

>ISO 2022 decoding affects input bytes in the ranges 33 to 126 and
>160 to 255, known as "the left half" and "the right half" respectively.
>All other bytes, unless they belong to a control sequence shown in
>this document, remain unchanged.

This isn't true. In particular, other sets of control characters may
be designated and invoked to both C0 (0/0-1/15) and C1 (08/0-09/15) ranges.
Indeed, this is accomplished by the following:

ESC 2/1 (In Im) F desginate and invoke to C0
ESC 2/2 (In Im) F desginate and invoke to C1

The only restriction is that the following must hold for any sets of
control characters designated to C0:

1/11 remains ESCAPE (ESC)
1/14 remains SHIFT-OUT (SO)
1/15 remains SHIFT-IN (SI)

Also, since one can't designate a 96 character set to G0, 2/0 and
7/15 always remain SPACE and DEL, respectively.

>SI (byte 15) Interpret the left half as G1 characters
>SO (byte 14) Interpret the left half as G0 characters

You have these reversed. SHIFT-OUT invokes G1 set to GL (2/1-2/15);
SHIFT-IN invokes G0 set to GL. Here "out" means "out from G0 to G1"
and "in" means "back in from G1 to G0".

>SS2 (byte 142) Interpret next character only as G2
>ESC N Interpret next character only as G2
>SS3 (byte 143) Interpret next character only as G3
>ESC O Interpret next character only as G3

Note that ESC 4/14 (N) is simply the way of expressing C1
controls, here 08/14 (SS2), in a 7-bit environment; strictly speaking,
an 8-bit environment should prefer use of 08/14. The same applies
for ESC 4/15 (O) and 08/15 (SS3).

In addition, you say "interpret the next character as ..."; what
should be said is "interpret the next one or more bytes (octets)
as one character designated to G2 or G3".

>In ISO-2022-JP, the Japanese flavor of ISO 2022, only the bytes 33-126
>and the G0 character set is used, and escape sequences are used to
>switch between ASCII, ISO-646-JP (the Japanese national variant of ASCII),
>and JIS 0208X-1983.

More accurately speaking, ISO-2022-JP is an application of ISO-2022 for
Japanese locales. It is not a Japanese (i.e., JIS) flavor of ISO-2022.
Also, I think you mean to say that "only the bytes 33-126" are used in
the denotation of graphic characters (other than SPACE). Further, ISO-2022-JP
provides means for designating both 1978 and 1983 versions of JIS X 0208
independently.

>In other versions, the G1 character set has 94 x 94 size, and so any
>byte in the range 160-255 is automatically the first byte of a double-byte
>character.

This is not true in general. What is true is that for certain communities
of communicating entities, it is assumed that a double byte 94x94 character
set is designated to G1 and invoked to GR (right half). Other communities
may easily assume other initial conditions (e.g., a single byte 96 character
set is designated to G1 and invoked to GR).

The following additional documents should be reviewed by people interested
in learning about current practices, particularly in East Asia:

RFC1468 "Japanese Character Encoding for Internet Message," by J. Murai,
        M. Crispin, and E. van der Poel

RFC1554 "ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP", by
        M. Ohta and K. Handa

RFC1557 "Korean Character Encoding for Internet Messages", by U. Choi,
        K. Chon, and H. Park

RFC1922 "Chinese Character Encoding for Internet Messages", by H. F. Zhu,
        D. Y. Hu, Z. G. Wang, T. C. Kao, W. Ch. Chang, and M. Crispin

Regards
Glenn Adams



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT