L2/99-326

Re: Proposal for Unicode Character Mapping Format Description

From: Mark Davis

Date: 1999-10-18

I had started writing this to document the format used for the mapping tables on the website. These formats are somewhat ad-hoc, but I tried to follow what existed on the website where possible. I did introduce a few new conventions that would allow the data to be fleshed out in a standard way in the future.


Unicode Character Mapping Formats

Draft, 10-15 MED

The Unicode website provides character mappings to and from Unicode for a number of different code pages for use in character code conversion (also called charset or code page transcoding). Some of these mappings are supplied by the Unicode Consortium; others are supplied directly by the vendors. This document describes the format for those files.

In all cases, comments start with '#', and continue to the end of the line. For historical reasons, a number of comment lines may actually have significant data; these are described below. Line ends may be CR, LF, or CRLF. Whitespace between items consists of one or more spaces or tabs in arbitrary combination. Whitespace between '#' and the following comment is optional.

This file describes the recommended format for the mapping files, and gives directions for creating new files. Most of the files on the Unicode site are supplied by vendors, and may not yet have been updated to this format.

Before discussing the format, it is important to review some necessary concepts.

Illegal and Unassigned Codes

Client software may need to distinguish the different types of mismatches that can occur when transcoding data to and from  Unicode. These fall into the following categories:

  1. The sequence  is unassigned (aka undefined).
    For example,
  2. The sequence is incomplete.
    For example,
  3. The sequence is illegal.
    For example,

Unassigned characters are treated as a single code point: for example, 0xA3BF is treated as a single code point when mapping into Unicode from CP950. The actual conversion routines will typically handle an unassigned value in a variety of ways (depending on the parameters passed in), such as:

Note that there is an important difference between the case where a sequence represents a real REPLACEMENT CHARACTER in a legacy set, as opposed to just being unassigned, and thereby sometimes being mapped to REPLACEMENT CHARACTER for that reason.

Illegal values represent some corruption of the data stream. Conversion routines may be directed to handle this in a different way than by replacement characters. For example, a routine might map unassigned characters to a substitution character, but throw an exception on illegal values.

Completeness

It is important that a mapping file be a complete description. From the data in the file, it should be possible to tell for any sequence of bytes whether that sequence is assigned, unassigned, incomplete, or illegal.

Header

A mapping file begins with a comment header with the following information.

#    Name:     <name of the encoding>
#    Description:  <should be complete enough to distinguish from others>
#    Ordering: <visual or logical for bidirectional scripts>
#    Aliases:  <common alternative names for the encoding>
#    Unicode version: <Unicode version>
#    Table version: <current table version>
#    Date:         <date: ideally YYYY-MM-DD>
#    Contact:       <email for maintainers>
#    Change History:
#	<comments on changes from previous to current>
#...
#	<comments on changes from first to second>
#    General notes:
#       <optional material>

Notes:

Data

The header is then followed by the data, which is in one of the following forms.

Basic Data Line

A basic data line has the following form:

<encoding values> <Unicode values> # <name comment>
Notes:

Examples:

0x6D         0x006D                 # LATIN SMALL LETTER M
0x8591       0xF860,0x0030,0x002E   # DIGIT ZERO FULL STOP
0x85,0x61    0x00C0                 # LATIN CAPITAL LETTER A WITH GRAVE

Range Data Line

Data can also be given as ranges, by separating values with "-". This is most useful for mapping user-defined characters to the private use zones. They have the following format:

<encoding range> <Unicode range> # <comment>

The number of values on each side must match.

Examples:

0xF040-0xF07E    0xE000-0xE03E # first user defined range

Illegal/Incomplete Sequences

Bytes that are only valid in combination with certain subsequent bytes are marked as:

0x81 #DBCS LEAD BYTE

The possible bytes that are valid following lead bytes are marked as:

0x40 #DBCS TRAIL BYTE

Range notation can also be used:

0x81-0x9F   #DBCS LEAD BYTE
0xE0-0xFC   #DBCS LEAD BYTE
0x40-0x7E   #DBCS TRAIL BYTE
0x80-0xFC   #DBCS TRAIL BYTE

Note that if 0xFC were not marked as a trail byte, then the sequence 0xFC,0x41 would be treated as two values: a single unassigned value, followed by the character 'A'.

Illegal values are marked as follows:

0xFF        #ILLEGAL

Unassigned characters usually do not need to be marked explicitly. The one case where this may be required is when using Delta descriptions, where an assigned value is "unassigned".

0xFF        #UNDEFINED

Fallback Characters

In some cases, users have the option of using fallback characters, where an character that is not represented in the target code page  is given a "best fit" mapping. For example, an encoding might not have curly quotes; the generic quotes can be used as a fallback. Any fallback mappings should be provided before the main mapping. That is, any latter data line overrides any earlier one. For readability, the fallback section should be marked with comments, as below.

For example, the following indicates that 0x22 should be mapped to U+0022. When mapping back from Unicode, U+0022 is mapped to 0x22; in addition, if the fallback option is on, then U+201D and U+201C are also mapped to 0x22.

#FALLBACKS

0x22    0x201C     # LEFT DOUBLE QUOTATION MARK
0x22    0x201D     # LEFT DOUBLE QUOTATION MARK

#MAIN MAPPINGS

...
0x22    0x0022     # QUOTATION MARK
...

Delta Mappings

In some cases, a mappings is only a minor variation of another mapping. When this is the case, this can be indicated without copying the entire contents of the file, by just supplying the source mapping and the changed lines. As it turns out, the overriding can be described very simply using an IMPORT style mechanism and the same overriding used for Fallbacks, using the following format.

#IMPORT    ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/CP852.TXT

0x1A 0x001C
0x1C 0x007F
0x7F 0x001A
0xAA #UNDEFINED