Character Mapping Formats

1999-11-02 MED

The Unicode website provides character mappings to and from Unicode for a number of different code pages for use in character code conversion (also called code page transcoding). Some of these mappings are supplied by the Unicode Consortium; others are supplied directly by the vendors. This document describes the format for those files. Most of the files on the Unicode site are supplied by vendors, and may not yet have been updated to this format.

Note: The Unicode Technical Committee is discussing a proposal in UTR #22: Character Mapping Tables which will provide a more complete solution to the reliable exchange of data for character mappings. That report also defines terms that will be used here.

File Format

In all cases, comments start with '#', and continue to the end of the line. For historical reasons, a number of comment lines may actually have significant data; these are described below. Line ends may be CR, LF, or CRLF. Whitespace between items consists of one or more spaces or tabs in arbitrary combination. Whitespace between '#' and the following comment is optional.

Header

A mapping file begins with a comment header with the following information.

#    Name:     <name of the encoding>
#    Description:  <should be complete enough to distinguish from others>
#    Ordering: <visual or logical for bidirectional scripts>
#    Aliases:  <common alternative names for the encoding>
#    Unicode version: <Unicode version>
#    Table version: <current table version>
#    Date:         <date: ideally YYYY-MM-DD>
#    Contact:       <email for maintainers>
#    Change History:
#	<comments on changes from previous to current>
#...
#	<comments on changes from first to second>
#    General notes:
#       <optional material>

Notes:

The header is then followed by the data, which is in one of the following forms.

Basic Data Line

A basic data line has the following form:

<encoding values> <Unicode values> # <name comment>
Notes:

Examples:

0x6D         0x006D                 # LATIN SMALL LETTER M
0x8591       0xF860,0x0030,0x002E   # DIGIT ZERO FULL STOP
0x85,0x61    0x00C0                 # LATIN CAPITAL LETTER A WITH GRAVE

Range Data Line

Data can also be given as ranges, by separating values with "-". This is most useful for mapping user-defined characters to the private use zones. They have the following format:

<encoding range> <Unicode range> # <comment>

The number of values on each side must match.

Examples:

0xF040-0xF07E    0xE000-0xE03E # first user defined range

Illegal/Incomplete Sequences

Bytes that are only valid in combination with certain subsequent bytes are marked as:

0x81 #DBCS LEAD BYTE

The possible bytes that are valid following lead bytes are marked as:

0x40 #DBCS TRAIL BYTE

Range notation can also be used:

0x81-0x9F   #DBCS LEAD BYTE
0xE0-0xFC   #DBCS LEAD BYTE
0x40-0x7E   #DBCS TRAIL BYTE
0x80-0xFC   #DBCS TRAIL BYTE

Note that if 0xFC were not marked as a trail byte, then the sequence 0xFC,0x41 would be treated as two values: a single unassigned value, followed by the character 'A'.

Illegal values are marked as follows:

0xFF        #ILLEGAL

Unassigned characters do not need to be marked explicitly, however, for clarity they may be with the following comment.

0xFF        #UNDEFINED