|Authors||Ken Whistler (firstname.lastname@example.org), Mark Davis (email@example.com)|
This document clarifies a number of the terms used to describe character encodings, and where the different forms of Unicode fit in. It elaborates the Internet Architecture Board (IAB) three-layer "text stream" definitions into a five-layer structure.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Technical Report. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Technical Report (UTR) may contain either informative material or normative specifications, or both. Each UTR may specify a base version of the Unicode Standard. In that case, conformance to the UTR requires conformance to that version or higher.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/.
For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/
Please mail corrigenda and other comments to the author(s).
There are a number of inconsistencies and misunderstandings about just what Unicode is in the context of character encodings of all types. These have been highlighted by discussions about the process of registering "UTF-16BE" and "UTF-16LE" as IANA charsets for the Internet, as well as editorial problems resulting from the attempt to treat UTF-16 and UTF-8 uniformly in the revision of the text for the Unicode Standard, Version 3.0.
To clarify these matters, this document describes a model for the structure of character encodings. The character encoding model described here draws on the character architecture promoted by the IAB for use on the Internet. It also draws in part on the Character Data Representation Architecture (CDRA) used by IBM for organizing and cataloging its own vendor-specific array of character encodings. (For a list of common acronyms used in this text, see §9 Definitions and Acronyms). The focus here is on clarifying how these models should be extended and clarified to cover the needs of the Unicode Standard and ISO/IEC 10646.
The IAB model, as defined in RFC 2130, makes three distinctions with respect to level: Coded Character Set (CCS), Character Encoding Scheme (CES), and Transfer Encoding Syntax (TES). However, to adequately cover the distinctions required for the character encoding model, five levels need to be defined. One of these, the repertoire, is implicit in the IAB model. The other is an additional level between the CCS and the CES.
The five levels can be summarized as:
In addition, a Character Map (CM) is defined to be a mapping from an abstract character repertoire to a serialized sequence of bytes.
Note: If you have questions about any of the issues raised in this document, please consult the Unicode Frequently Asked Questions pages. These pages also contain an online definition of UTF-8.
A repertoire is defined as the set of abstract characters to be encoded, normally a familiar alphabet or symbol set. The word abstract just means that these objects are defined by convention, such as the 26 letters of the English alphabet, uppercase and lowercase forms.
Repertoires are unordered sets that come in two types: fixed and open. For most character encodings, the repertoire is fixed (and often small). Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire is conceived of as creating a new repertoire, which then will be given its own catalogue number, constituting a new object.
For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is intended to be the universal encoding, any abstract character that ever could be encoded is potentially a member of the actual set to be encoded, whether we currently know of that character or not.
Microsoft, for its Windows character sets, also makes use of a limited notion of open repertoires. The repertoires for particular character sets are periodically extended by adding a handful of characters to an existing repertoire. This recently occurred when the EURO SIGN was added to the repertoire for a number of Windows character sets, for example.
The Unicode Standard versions its repertoire by publication of major and minor editions of the standard: 1.0, 1.1, 2.0, 2.1, 3.0,... The repertoire for each version is defined by the enumeration of abstract characters included in that version. There was a major glitch between versions 1.0 and 1.1, occasioned by the merger with ISO/IEC 10646, but starting with version 2.0 and continuing forward indefinitely into future versions, no character once included is ever removed from the repertoire. (There are also update versions of the Unicode character database, such as 2.1.5. These update versions do not differ in character repertoire, but may amend character properties and behavior. For more information, see Versions of the Unicode Standard.)
ISO/IEC 10646 has a different mechanism of extending its repertoire. The 10646 repertoire is extended by a formal amendment process. As each individual amendment is ballotted, approved, and published, that may constitute an extension to the 10646 repertoire, depending on the content of the amendment. The tricky part about keeping the repertoires of the Unicode Standard and of ISO/IEC 10646 in alignment is coordinating the publication of major versions of the Unicode Standard with publication of a well-defined list of amendments for 10646 (or a major revision and republication of 10646).
Repertoires are the things that in the IBM CDRA architecture get CS ("character set") values.
The distinction between characters and glyphs is important. A glyph is a particular image which represents a character or part of a character. It may have very different shapes: below are just some of the possibilities for the letter a. (In the examples below, a selection of alternatives are presented in different cells in the table.)
Glyphs do not correspond one-for-one with characters. For example, a sequence of f followed by i may be represented with a single glyph, called an fi ligature. Notice that the shapes are merged together, and the dot is missing from the i.
|Character Sequence||Sample Glyph|
On the other hand, the same image as the fi ligature could be achieved by a sequence of two glyphs with the right shapes. The choice of whether to use a single glyph or a sequence of two is up to the font containing the glyphs and the rendering software.
|Character Sequence||Possible Glyph Sequence|
Similarly, an accented character could be represented by a single glyph, or by different component glyphs positioned appropriately. In addition, the separate accents can also be considered characters in their own right, in which case a sequence of characters can also correspond to different possible glyph representations:
|Character Sequence||Possible Glyph Sequences|
In non-Latin languages, the connection between glyphs and characters may be even less direct. Glyphs may be required to change their shape and widths depending on the surrounding glyphs. These glyphs are called contextual forms. For example, see the Arabic glyphs below.
|Character||Possible Glyphs, depending on context|
Glyphs may also need to be widened for justification instead of simply adding width to the spaces. Ideally this would involve changing the shape of the glyph depending on the desired width. On some systems, this widening may be achieved by inserting extra connecting glyphs called kashidas. In such a case, a single character may conceivably correspond to a whole sequence of kashidas + glyphs + kashidas.
|Character||Sequence of glyphs|
In other cases a single character must correspond to two glyphs, because those two glyphs are positioned around other letters. See the Tamil characters below. If one of those glyphs forms a ligature with other characters, then we have a situation where a conceptual part of a character corresponds to visual part of a glyph. If a character (or any part of it) corresponds to a glyph (or any part of it), then we say that the character contributes to the glyph.
The upshot is that the correspondence between glyphs and characters is not one-to-one, and cannot in general be predicted from the text. The ordering of glyphs will also not in general correspond to the ordering of the characters, because of right-to-left scripts like Arabic and Hebrew. Whether a particular string of characters is rendered by a particular sequence of glyphs will depend on the sophistication of the host operating system and the font.
It is important to note that for historical reasons, abstract character repertoires may include many entities that normally would not be considered appropriate members of an abstract character repertoire. These may include ligature glyphs, contextual form glyphs, glyphs that vary by width, sequences of characters, and adorned glyphs (such as circled numbers). Below are some examples where these are encoded as single characters in Unicode. As with glyphs, there are not necessarily one-to-one relationships between characters and code points.
What an end-user thinks of as a single character (aka a grapheme) may in fact be represented by multiple code points; conversely, a single code point may correspond to multiple characters. Here are some examples:
|Arabic contextual form glyphs|
|A single code point representing a sequence of three characters.|
|The Devanagari syllable ksha represented by three code points.|
|G-ring represented by two code points.|
Unlike most character repertoires, Unicode/10646 is deliberately intended to be universal in coverage. What this implies in practice, given the complexity of many writing systems, is that nearly all implementations will implement some subset of the total repertoire, rather than all the characters.
Formal subset mechanisms are occasionally seen in implementations of some Asian character sets, where for example, the distinction between "Level 1 JIS" and "Level 2 JIS" support refers to particular parts of the repertoire of the JIS X 0208 kanji characters to be included in the implementation.
Subsetting is a major formal aspect of ISO/IEC 10646-1. The standard includes a set of internal catalog numbers for named subsets, and further makes a distinction between subsets that are fixed collections and those that are open collections, defined by a range of code positions. (See Technical Corrigendum No. 2 to ISO/IEC 10646-1:1993(E) for details.) The collections that are defined by a range of code positions are themselves open subsets of the repertoire, since they could be extended at any time by an addition to the repertoire which happens to get encoded in a code position between the range limits which define such a collection.
The current TC304 project to define multilingual European subsets (MES-1, MES-2, MES-3A, and MES-3B) of ISO/IEC 10646-1 is a CEN effort to define three more subsets (each a fixed collection) that will, no doubt, at some point be added as named subsets in 10646.
For the Unicode Standard, subsets are nowhere formally defined. It is considered up to the implementation to define and support the subset of the universal repertoire that it wishes to interpret.
A coded character set is defined to be a mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be contiguous.
An abstract character is defined to be in a coded character set if the coded character set maps from it to an integer. That integer is said to be the code point for the abstract character. That abstract character is then an encoded character.
Effectively, coded character sets are the basic object that both ISO and vendor character encoding committees produce. They relate a defined repertoire to nonnegative integers, which then can be used unambiguously to refer to particular abstract characters from the repertoire.
The Unicode concept of the Unicode scalar value (cf. D28, in chapter 3 of the Unicode Standard) is explicitly this code point, used for mapping of the Unicode repertoire.
A coded character set may also be known as a character encoding, a coded character repertoire, a character set definition, or a code page.
The IBM CDRA architecture CP ("code page") values refer to coded character sets. (Note that this use of the term code page is quite precise and limited. It should not be — but generally is— confused with the generic use of code page to refer to character encoding schemes. See below.)
In the JTC1/SC2 context, coded character sets also require the assignment of unique character names to each abstract character in the repertoire to be encoded. This practice is not generally followed in vendor coded character sets or the encodings produced by standards committees outside SC2, where the names provided for characters, if any, are often variable and annotative, rather than normative parts of the character encoding.
The main rationale for the SC2 practice of character naming was to provide a mechanism to unambiguously identify abstract characters across different repertoires given different mappings to integers in different coded character sets. Thus LATIN SMALL LETTER A WITH GRAVE would be seen as the same abstract character, even when it occurred in different repertoires and was assigned different integers, depending on the particular coded character set.
This functionality of ensuring character identity across different coded character sets (or "code pages") is handled in the IBM CDRA model instead by assigning a catalogue number, known as a GCGID (graphic character glyphic identifier), to every abstract character used in any of the repertoires accounted for by the CDRA. Abstract characters that have the same GCGID in two different coded character sets are by definition the same character. Other vendors have made use of similar internal identifier systems for abstract characters.
The advent of Unicode/10646 has largely rendered such schemes obsolete. The identity of abstract characters in all other coded character sets is increasingly being defined by reference to Unicode/10646 itself. Part of the pressure to include every "character" from every existing coded character set into the Unicode Standard results from the desire by many to get rid of subsidiary mechanisms for tracking bits and pieces, odds and ends that aren’t part of Unicode, and instead just make use of the Unicode Standard as the universal catalog of characters.
The range of nonnegative integers used for the mapping of abstract characters defines a related concept of code space. Traditional boundaries for types of code spaces are closely tied to the encoding forms (see below), since the mappings of abstract characters to nonnegative integers are not done arbitrarily, but with particular encoding forms in mind. Examples of significant code spaces are 0..7F, 0..FF, 0..FFFF, 0..10FFFF, 0..7FFFFFFF, 0..FFFFFFFF.
Code spaces can also have fairly elaborated structures, depending on whether the range of integers is conceived of as continuous, or whether particular ranges of values are disallowed. Most complications again result from considerations of encoding form; when an encoding form specifies that the integers used in encoding are to be realized as sequences of octets, there are often constraints placed on the particular values that those octets may have – mostly to avoid control code points. Expressed back in terms of code space, this results in multiple ranges of integers that are disallowed for mapping a character repertoire. (See [Lunde] for two-dimensional diagrams of typical code spaces for Asian coded character sets.)
A character encoding form is a mapping from the set of integers used in a CCS to the set of sequences of code units. A code unit is an integer occupying a specified binary width in a computer architecture, such as an 8-bit byte. The encoding form enables character representation as actual data in a computer. The sequences of code units do not necessarily have the same length.
A character encoding form for a coded character set is defined to be a character encoding form that maps all of the encoded characters for that coded character set.
Note: In many cases, there is only one character encoding form for a given coded character set. In some such cases only the character encoding form has been specified. This leaves the coded character set implicitly defined, based on an implicit relation between the code unit sequences and integers.
When interpreting a sequence of code units, there are three possibilities:
The encoding form for a CCS may result in either fixed-width or variable-width collections of code units associated with abstract characters. The encoding form may involve an arbitrary reversible mapping of the integers of the CCS to a set of code unit sequences.
Encoding forms come in various types. Some of them are exclusive to the Unicode/10646, whereas others represent general patterns that are repeated over and over for hundreds of coded character sets. Here are of some of the more important examples of encoding forms.
Examples of fixed-width encoding forms:
Examples of variable-width encoding forms:
The encoding form defines one of the fundamental relations that internationalized software cares about: how many code units are there for each character. This used to be expressed in terms of how many bytes each character was represented by. With the introduction of UCS-2, UTF-16, UCS-4, and UTF-32 with wider code units for Unicode and 10646, this is generalized to two pieces of information: a specification of the width of the code unit, and the number of code units used to represent each character.
UTF-8 provides a good example:
0x00..0x7F ==> 1 byte 0x80..0x7FF ==> 2 bytes 0x800..0xD7FF, 0xE000..0xFFFF ==> 3 bytes 0x10000 .. 0x10FFFF ==> 4 bytes
Examples of encoding forms as applied to particular coded character sets:
A character encoding scheme is a mapping of code units into serialized byte sequences. Character encoding schemes are relevant to the issue of cross-platform persistent data involving code units wider than a byte, where byte-swapping may be required to put data into the byte polarity canonical for a particular platform. In particular:
It is important not to confuse a CEF and a CES.
The mapping from an abstract character repertoire to a serialized sequence of bytes is called a Character Map (CM). A simple character map thus implicitly includes a CCS, a CEF, and a CES, mapping from abstract characters to code units to bytes. A compound character map includes a compound CES, and thus includes more than one CCS and CEF. In that case, the abstract character repertoire for the character map is the union of the repertoires covered by the coded character sets involved.
Character Maps are the things that in the IAB architecture get IANA charset identifiers. The important thing, from the IANA charset point of view is that a sequence of encoded characters must be unambiguously mapped onto a sequence of bytes by the charset. The charset must be specified in all instances, as in Internet protocols, where textual content is treated as a ordered sequence of bytes, and where the textual content must be reconstructible from that sequence of bytes.
Character Maps are also the things that in the IBM CDRA architecture get CCSID (coded character set identifier) values. A character map may also be known as a charset, a character set, a code page (broadly construed), or a CHARMAP.
In many cases, the same name is used for both a character map and for a character encoding scheme, such as UTF-16BE. Typically this is done for simple character maps when such usage is clear from context.
A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more character encoding schemes.
Typically TES’s are engineered either to:
The Internet Content-Transfer-Encoding tags "7bit" and "8bit" are special cases. These are data width specifications relevant basically to mail protocols and which appear to predate true TES’s like quoted-printable. Encountering a "7bit" tag doesn’t imply any actual transform of data; it merely is an indication that the charset of the data can be represented in 7 bits, and will pass 7-bit channels – it is really an indication of the encoding form. In contrast, quoted-printable actually does a conversion of various characters (including some ASCII) to forms like "=2D", "=20", etc., and should be reversed on receipt to regenerate legible text in the designated character encoding scheme.
Most API’s are specified in terms of either code units or serialized bytes. An example of the first are Java String and char APIs, which use UTF-16 code units. Another example is C and C++ wchar_t interfaces used for DBCS processing codes. For code units, the byte order of the platform is generally not relevant in the API; the same API can be compiled on platforms with any byte polarity, and will simply expect character data (as for any integral-based data) to be passed to the API in the byte polarity for that platform.
C and C++ char* APIs use serialized bytes, which could represent a variety of different character maps, including ISO Latin 1, UTF-8, Windows 1252, as well as compound character maps such as Shift-JIS or 2022-JP. A byte API could also handle UTF-16BE or UTF-16LE, which are serialized forms of Unicode. However, these APIs must be allow for the existence of any byte value, and typically use memcpy plus length instead of strcpy for manipulating strings.
The main body of this document consists of an attempt at detailed definition of several terms related to character encoding. This section merely clarifies acronyms and a few other subsidiary terms used in various contexts.
[CDRA] Character Data Representation Architecture Reference and Registry, IBM Corporation, Second Edition, December 1995. IBM document SC09-2190-00
[Lunde] Lunde, Ken, CJK Information Processing, O'Reilley, 1999
[RFC2130] The Report of the IAB Character Set Workshop held 29 February 1 March, 1996. C. Weider, et al., April 1997
[RFC2277] IETF Policy on Character Sets and Languages, H. Alvestrand, January 1998
[W3CCharMod] Character Model for the World Wide Web, http://www.w3.org/TR/WD-charmod
Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.