Revised ISO 2022 explanation

From: John Cowan (cowan@drv.cbc.com)
Date: Thu Nov 06 1997 - 17:55:32 EST


WARNING: Delete this now if you care nothing about ISO 2022.
If you are interested in it, please send commentary to me
at <cowan@ccil.org>. Thanks to Martin J. Dürst and Glenn Adams
for their generous reviews of my earlier version.

===start here===

This document is a loose, but still fairly technical, explanation of the
international character set standard ISO 2022:1994, "Character Code Structure
and Extension Techniques". The standard itself, under its alternative
name of ECMA-35, is freely available on the Web in PDF and Microsoft Word
formats at http://www.ecma.ch/stand/ecma-035.HTM .

However, it is not easy to read unless you are already very familiar with
the world of ISO character set standards. I have therefore attempted to
make a clearer, if less precise, document that will help people who
are new to ISO 2022 to understand it. Of course, this document is
not a substitute for the standard.

Furthermore, ISO 2022 is an extremely rich and complex standard, with a very
large number of options, many of which have never been used in practice,
and only a few of which are typically exercised at one time.
Therefore, certain elements which are rarely or never used in practice
are mentioned only in passing or are omitted outright.

In ISO 2022, text is specified using a mixture of registered character sets.
An international registry is maintained by ECMA on behalf of ISO; the
names of the registered sets are available on the Web in Microsoft Word format
at ftp://diku.dk/JTC1/SC2/WG3/rag01.doc.

Character sets come in one of three sizes: single-byte character sets
with 94 characters (e.g. ASCII), single-byte character sets with 96
characters (e.g. the top halves of ISO Latin-1 to Latin-5), or
double-byte character sets with 94 x 94 characters (e.g. JIS 0208X-1983).

(In principle, 96 x 96 character sets could also exist, as well as
94 x 94 x 94, 96 x 96 x 96, and even larger possibilities, but no one has
ever needed them.)

Each registered character set is specified by a byte,
known as the "final byte", in the range 64 to 125. The bytes are unique within
character set sizes, but are reused across sizes. For example, the byte 66
represents the 94-character set ASCII, the 96-character set ISO Latin-2 (top
half), and the 94 x 94 Japanese character set JIS 0208X-1983. In this
document, the final byte assigned to a character set will be represented by F.

The F value of 126 represents a character set with no characters.
F values in the range 48 to 63 represent private-use character sets,
which are not registered and must be defined by private agreements.
If the range of F values ever runs out, the standard defines a method
for using multiple bytes instead. Any multiple-byte sequence beginning
with a 32 byte is also for private use.

Character sets (and other things) are specified using escape sequences.
An escape sequence is a series of bytes beginning with the byte 27
(ESC), followed by any number of bytes in the range 32-47, followed by
a single byte in the range 48-126. No other bytes are allowed within
an escape sequence.

At any one time, up to four character sets are available for use.
The four slots that hold character sets are labeled G0, G1, G2, and G3.
The following escape sequences (where ESC = the byte 27) designate
which character sets occupy which slots:

        27 40 F Set G0 to the 94-character set F
        27 41 F Set G1 to the 94-character set F
        27 42 F Set G2 to the 94-character set F
        27 43 F Set G3 to the 94-character set F
        27 45 F Set G1 to the 96-character set F
        27 46 F Set G3 to the 96-character set F
        27 47 F Set G3 to the 96-character set F
        27 36 F Set G0 to the 94 x 94 character set F,
                             provided F is 64, 65, or 66
        27 36 40 F Set G0 to the 94 x 94 character set F,
                             provided F is not 64, 65, or 66
        27 36 41 F Set G1 to the 94 x 94 character set F
        27 36 42 F Set G3 to the 94 x 94 character set F
        27 36 43 F Set G3 to the 94 x 94 character set F

Note that G0 can never be a 96-character set.

ISO 2022 decoding specifies the interpretation of bytes in the ranges
33 to 126 and 160 to 255, known as the "GL area" and the "GR area"
respectively. The character set in slot G0 can be placed into the GL area,
and the character sets in slots G1, G2, and G3 can be placed into
either the GL area or the GR area, by the use of the following control
characters and escape sequences:

        15 Interpret bytes in the GL area as G1 characters
        14 Interpret bytes in the GL area as G0 characters
        27 110 Interpret bytes in the GL area as G2 characters
        27 111 Interpret bytes in the GL area as G3 characters
        27 126 Interpret bytes in the GR area as G1 characters
        27 125 Interpret bytes in the GR area as G2 characters
        27 124 Interpret bytes in the GR area as G3 characters
        142 Interpret the next byte only as a G2 character; if G2
                                is a 94 x 94 character set, interpret
                                the next two bytes as a G2 character
        27 78 Same as 142
        143 Interpret the next byte only as a G3 character; if G3
                                is a 94 x 94 character set, interpret
                                the next two bytes as a G3 character
        27 79 Same as 143

The bytes 32 and 127 are always interpreted as SPACE and DELETE respectively,
unless a 96-byte character set has been placed in the GL area, in which case
they are interpreted as the appropriate characters of that set.

Bytes 0 to 31, called the "CL area", and bytes 128 to 159, called the
"CR area", are always interpreted as control characters. Alternative sets
of control characters have been registered, and escape sequences are
defined by the standard to specify which control character sets are in
effect, but these are rarely used. For convenience in a 7-bit environment,
the escape sequences 27 64 through 27 95 are exactly equivalent to
the control characters currently in effect in the CR area.

This rich schema may be used in various ways. In ISO-2022-JP, the Japanese
flavor of ISO 2022, only the bytes 33-126 and the G0 character set are used,
and escape sequences are used to switch between ASCII, ISO-646-JP (the
Japanese national variant of ASCII), and JIS 0208X-1983. In other versions,
the G1 character set has 94 x 94 size, and any byte in the range 160-255
is automatically the first byte of a double-byte character.

===end here===

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
			e'osai ko sarji la lojban



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT