UTF-8 as character set (was: Java and UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 02 1997 - 16:19:53 EDT


Pierre commented as an aside:

> Still a bit strange to find UTF-8 (a transform, ie. an algorithm)
> besides MacThai (an encoding, ie. a table). But, semantic subtleties
> aside, it's there.

This speaks to a subtle distinction which is not always being
made. UTF-8 as defined in the standard (10646, Amendment 1)
is a transformation format. In the Internet terminology promulgated
by the Internet Architecture Board it is a "CES" (Character Encoding
Scheme) as opposed to a "CSS" (Coded Character Set).

However, in actual usage, people are treating UTF-8 as a character set,
parallel to MacThai (or ISO 8859-1, or ...)

The following diagram may give a more complete picture of the situation.

Standard(s)/PAS Repertoire CCS CES Short Tag

ISO 8859-1:1987 Latin-1 8859-1 8bit-I iso8859-1
ISO 10646-1:1993 all 10646 UCS-4 ucs4
ISO 10646-1:1993, USV2 all 10646 UTF-8 utf8
ISO 10646-1:1993, USV2 all 10646 UTF-16 utf16
ISO 10646-1:1993 BMP 10646 UCS-2 ucs2
CDRA (CCSID 00850) CS01106 CP850 8bit cp850
CDRA (CCSID 00037) CS00697 CP037 8bit-E cp037
CDRA (CCSID 00938) CS00103+ CP904+ DBCS-M cp938
                        CS00935 CP927
CDRA (CCSID 00937) CS01175+ CP037+ SISO cp937
                        CS00935 CP835
Mac OS Cyrillic Mac Cyrillic MacCyr 8bit mac-cyr

Some definition of terms:

PAS: publicly available specification
CDRA: Character Data Representation Architecture (IBM's standard)
CCSID: coded character set identifier
CS: character set (IBM's term for a repertoire)
CP: code page (IBM's term for a coded character set)
USV2: The Unicode Standard, Version 2
BMP: Basic Multilingual Plane

8bit-I: 8-bit encoding, following the ISO architecture (no C0/C1 graphic chars)
8bit: Full 8-bit encoding
8bit-E: 8-bit EBCDIC encoding
DBCS-M: Shift-DBCS encoding
SISO: IBM host SI/SO run-encoded DBCS

Character set "conversions" actually refer to entire rows of the
diagram above. The information I need to know is:

  A. What is the relevant standard I can refer to for definitive information?
  B. What is the exact repertoire of characters covered?
  C. What numbers are assigned to those characters?
  D. How is the stream of bytes (octets) in encoded data associated
        with the numbers representing those characters?
  E. What unambiguous term can I use to refer to the association of B, C & D?

That all of this information is required can be determined by a little
research into the CDRA or some actual experience working with actual
"character sets". The same IBM "code page" can be associated with
different repertoires. The same IBM repertoires can be encoded differently
(Revised Code Page 37 and Code Page 500 have the same repertoires, which also
match Latin-1). And for Unicode, the same repertoire and coded character
set can be (and often are) represented with different encoding schemes:
UTF-16 and UTF-8. And there is the well-known problem of different
"Shift-JIS" character sets, which contain different repertoires.

So if we return to the original concern and look at the diagram above,
"UTF-8" is a character encoding scheme, but "utf8" as the tag value for
a row in the diagram can be taken as a valid value for a conversion
object, parallel to "iso8859-1", "cp037", or "mac-cyr".

IBM has the most detailed and correctly applied terminology for repertoires,
coded character sets (code pages), and also has unique identifiers (the
CCSID's) for each association (i.e. row). Ironically, it is 10646, with
its introduction of the "forms of encoding" and the "transformation
formats" which has really highlighted the CES issue which has been
lurking all along in character set technology, but which was often
ignored since the 7bit and 8bit encoding schemes are trivial and
since each standard usually allowed only a single encoding scheme.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT