L2/99-326

Re: Proposal for Unicode Character Mapping Format Description

From: Mark Davis

Date: 1999-10-18

I had started writing this to document the format used for the mapping tables on the website. These formats are somewhat ad-hoc, but I tried to follow what existed on the website where possible. I did introduce a few new conventions that would allow the data to be fleshed out in a standard way in the future.

Unicode Character Mapping Formats

Draft, 10-15 MED

The Unicode website provides character mappings to and from Unicode for a number of different code pages for use in character code conversion (also called charset or code page transcoding). Some of these mappings are supplied by the Unicode Consortium; others are supplied directly by the vendors. This document describes the format for those files.

In all cases, comments start with '#', and continue to the end of the line. For historical reasons, a number of comment lines may actually have significant data; these are described below. Line ends may be CR, LF, or CRLF. Whitespace between items consists of one or more spaces or tabs in arbitrary combination. Whitespace between '#' and the following comment is optional.

This file describes the recommended format for the mapping files, and gives directions for creating new files. Most of the files on the Unicode site are supplied by vendors, and may not yet have been updated to this format.

Before discussing the format, it is important to review some necessary concepts.

Illegal and Unassigned Codes

Client software may need to distinguish the different types of mismatches that can occur when transcoding data to and from Unicode. These fall into the following categories:

The sequence is unassigned (aka undefined).
For example,
- 0xA3BF is unassigned in CP950
- 0x0EDD is unassigned in Unicode, V3.0
The sequence is incomplete.
For example,
- 0xA3 is incomplete in CP950.
  - Unless followed by another byte of the right form, it is illegal.
- 0xD800 is incomplete in Unicode.
  - Unless followed by another value of the right form, it is illegal.
- 0xDC00 is incomplete in Unicode.
  - Unless preceded by another value of the right form, it is illegal.
The sequence is illegal.
For example,
- 0xFF is illegal in CP950

Unassigned characters are treated as a single code point: for example, 0xA3BF is treated as a single code point when mapping into Unicode from CP950. The actual conversion routines will typically handle an unassigned value in a variety of ways (depending on the parameters passed in), such as:

stop or throw an exception
- in particular, this is commonly used by higher level character encodings, such as ISO 2022 conversions, to know when to stop converting into one set and pick another to convert to.
map it to a substitution character
- such as the Unicode U+FFFD REPLACEMENT CHARACTER
represent it by a hex escape sequence
- for example, when mapping from U+1234 to other code pages, it can be represented by "%12%34" in URLs, "ሴ" in XML or HTML, "\u1234" in Java or C++, or "\x{1234}" in Perl.

Note that there is an important difference between the case where a sequence represents a real REPLACEMENT CHARACTER in a legacy set, as opposed to just being unassigned, and thereby sometimes being mapped to REPLACEMENT CHARACTER for that reason.

Illegal values represent some corruption of the data stream. Conversion routines may be directed to handle this in a different way than by replacement characters. For example, a routine might map unassigned characters to a substitution character, but throw an exception on illegal values.

Completeness

It is important that a mapping file be a complete description. From the data in the file, it should be possible to tell for any sequence of bytes whether that sequence is assigned, unassigned, incomplete, or illegal.

Unless otherwise indicated in the data file, any sequences of bytes that are not mentioned are assumed to be unassigned.
All control values (C0, C1) should be explicitly mapped.
All private use (e.g. user defined) characters should be explicitly mapped, either to the private use zone in Unicode, or to the correct characters outside of that zone.
Only a real replacement character should be mapped to REPLACEMENT CHAR; unassigned characters should not be mapped to it. Similarly, when mapping back from Unicode, only the REPLACEMENT CHAR should map to SUB or other legacy equivalent.
Incomplete and illegal sequences should be indicated.

Header

A mapping file begins with a comment header with the following information.

#    Name:     <name of the encoding>
#    Description:  <should be complete enough to distinguish from others>
#    Ordering: <visual or logical for bidirectional scripts>
#    Aliases:  <common alternative names for the encoding>
#    Unicode version: <Unicode version>
#    Table version: <current table version>
#    Date:         <date: ideally YYYY-MM-DD>
#    Contact:       <email for maintainers>
#    Change History:
#	<comments on changes from previous to current>
#...
#	<comments on changes from first to second>
#    General notes:
#       <optional material>

Notes:

Unfortunately, code page names are not unique: different sources differ on the precise mappings for Shift-JIS, for example. The description should describe the code page in enough detail to distinguish it from other code pages. E.g. "Macintosh variant of Japanese Shift-JIS".
When multiple values are given, such as for maintainers, they are comma-separated (with optional whitespace).

Data

The header is then followed by the data, which is in one of the following forms.

Basic Data Line

A basic data line has the following form:

<encoding values> <Unicode values> # <name comment>

Notes:

The values are hexadecimal.
- They should be preceded by "0x" (some files don't)
Multiple values are separated by commas.
- Some files may use "+" instead of commas.
- A series of two bytes may omit the comma. That is, 0x8591 = 0x85,0x91
The values can optionally be followed by a semicolon.
The name comment is optional, but strongly recommended for readability. (This does not apply to blocks in Unicode that have algorithmic names, such as CJK Unified Ideographs.)
The data lines can occur in any order (but see Fallbacks, below).

Examples:

0x6D         0x006D                 # LATIN SMALL LETTER M
0x8591       0xF860,0x0030,0x002E   # DIGIT ZERO FULL STOP
0x85,0x61    0x00C0                 # LATIN CAPITAL LETTER A WITH GRAVE

Range Data Line

Data can also be given as ranges, by separating values with "-". This is most useful for mapping user-defined characters to the private use zones. They have the following format:

<encoding range> <Unicode range> # <comment>

The number of values on each side must match.

Examples:

0xF040-0xF07E    0xE000-0xE03E # first user defined range

Illegal/Incomplete Sequences

Bytes that are only valid in combination with certain subsequent bytes are marked as:

0x81 #DBCS LEAD BYTE

The possible bytes that are valid following lead bytes are marked as:

0x40 #DBCS TRAIL BYTE

Range notation can also be used:

0x81-0x9F   #DBCS LEAD BYTE
0xE0-0xFC   #DBCS LEAD BYTE
0x40-0x7E   #DBCS TRAIL BYTE
0x80-0xFC   #DBCS TRAIL BYTE

Note that if 0xFC were not marked as a trail byte, then the sequence 0xFC,0x41 would be treated as two values: a single unassigned value, followed by the character 'A'.

Illegal values are marked as follows:

0xFF        #ILLEGAL

Unassigned characters usually do not need to be marked explicitly. The one case where this may be required is when using Delta descriptions, where an assigned value is "unassigned".

0xFF        #UNDEFINED

Fallback Characters

In some cases, users have the option of using fallback characters, where an character that is not represented in the target code page is given a "best fit" mapping. For example, an encoding might not have curly quotes; the generic quotes can be used as a fallback. Any fallback mappings should be provided before the main mapping. That is, any latter data line overrides any earlier one. For readability, the fallback section should be marked with comments, as below.

For example, the following indicates that 0x22 should be mapped to U+0022. When mapping back from Unicode, U+0022 is mapped to 0x22; in addition, if the fallback option is on, then U+201D and U+201C are also mapped to 0x22.

#FALLBACKS

0x22    0x201C     # LEFT DOUBLE QUOTATION MARK
0x22    0x201D     # LEFT DOUBLE QUOTATION MARK

#MAIN MAPPINGS

...
0x22    0x0022     # QUOTATION MARK
...

Delta Mappings

In some cases, a mappings is only a minor variation of another mapping. When this is the case, this can be indicated without copying the entire contents of the file, by just supplying the source mapping and the changed lines. As it turns out, the overriding can be described very simply using an IMPORT style mechanism and the same overriding used for Fallbacks, using the following format.

#IMPORT    ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/CP852.TXT

0x1A 0x001C
0x1C 0x007F
0x7F 0x001A
0xAA #UNDEFINED