Re: Proposal for Unicode Character Mapping Format Description
From: Mark Davis
I had started writing this to document the format used for the mapping tables on the website. These formats are somewhat ad-hoc, but I tried to follow what existed on the website where possible. I did introduce a few new conventions that would allow the data to be fleshed out in a standard way in the future.
Draft, 10-15 MED
The Unicode website provides character mappings to and from Unicode for a number of different code pages for use in character code conversion (also called charset or code page transcoding). Some of these mappings are supplied by the Unicode Consortium; others are supplied directly by the vendors. This document describes the format for those files.
In all cases, comments start with '#', and continue to the end of the line. For historical reasons, a number of comment lines may actually have significant data; these are described below. Line ends may be CR, LF, or CRLF. Whitespace between items consists of one or more spaces or tabs in arbitrary combination. Whitespace between '#' and the following comment is optional.
This file describes the recommended format for the mapping files, and gives directions for creating new files. Most of the files on the Unicode site are supplied by vendors, and may not yet have been updated to this format.
Before discussing the format, it is important to review some necessary concepts.
Client software may need to distinguish the different types of mismatches that can occur when transcoding data to and from Unicode. These fall into the following categories:
Unassigned characters are treated as a single code point: for example, 0xA3BF is treated as a single code point when mapping into Unicode from CP950. The actual conversion routines will typically handle an unassigned value in a variety of ways (depending on the parameters passed in), such as:
Illegal values represent some corruption of the data stream. Conversion routines may be directed to handle this in a different way than by replacement characters. For example, a routine might map unassigned characters to a substitution character, but throw an exception on illegal values.
It is important that a mapping file be a complete description. From the data in the file, it should be possible to tell for any sequence of bytes whether that sequence is assigned, unassigned, incomplete, or illegal.
A mapping file begins with a comment header with the following information.
# Name: <name of the encoding> # Description: <should be complete enough to distinguish from others> # Ordering: <visual or logical for bidirectional scripts> # Aliases: <common alternative names for the encoding> # Unicode version: <Unicode version> # Table version: <current table version> # Date: <date: ideally YYYY-MM-DD> # Contact: <email for maintainers> # Change History: # <comments on changes from previous to current> #... # <comments on changes from first to second> # General notes: # <optional material>
The header is then followed by the data, which is in one of the following forms.
A basic data line has the following form:
<encoding values> <Unicode values> # <name comment>Notes:
0x6D 0x006D # LATIN SMALL LETTER M 0x8591 0xF860,0x0030,0x002E # DIGIT ZERO FULL STOP 0x85,0x61 0x00C0 # LATIN CAPITAL LETTER A WITH GRAVE
Data can also be given as ranges, by separating values with "-". This is most useful for mapping user-defined characters to the private use zones. They have the following format:
<encoding range> <Unicode range> # <comment>
The number of values on each side must match.
0xF040-0xF07E 0xE000-0xE03E # first user defined range
Bytes that are only valid in combination with certain subsequent bytes are marked as:
0x81 #DBCS LEAD BYTE
The possible bytes that are valid following lead bytes are marked as:
0x40 #DBCS TRAIL BYTE
Range notation can also be used:
0x81-0x9F #DBCS LEAD BYTE 0xE0-0xFC #DBCS LEAD BYTE 0x40-0x7E #DBCS TRAIL BYTE 0x80-0xFC #DBCS TRAIL BYTE
Note that if 0xFC were not marked as a trail byte, then the sequence 0xFC,0x41 would be treated as two values: a single unassigned value, followed by the character 'A'.
Illegal values are marked as follows:
Unassigned characters usually do not need to be marked explicitly. The one case where this may be required is when using Delta descriptions, where an assigned value is "unassigned".
In some cases, users have the option of using fallback characters, where an character that is not represented in the target code page is given a "best fit" mapping. For example, an encoding might not have curly quotes; the generic quotes can be used as a fallback. Any fallback mappings should be provided before the main mapping. That is, any latter data line overrides any earlier one. For readability, the fallback section should be marked with comments, as below.
For example, the following indicates that 0x22 should be mapped to U+0022. When mapping back from Unicode, U+0022 is mapped to 0x22; in addition, if the fallback option is on, then U+201D and U+201C are also mapped to 0x22.
#FALLBACKS 0x22 0x201C # LEFT DOUBLE QUOTATION MARK 0x22 0x201D # LEFT DOUBLE QUOTATION MARK #MAIN MAPPINGS ... 0x22 0x0022 # QUOTATION MARK ...
In some cases, a mappings is only a minor variation of another mapping. When this is the case, this can be indicated without copying the entire contents of the file, by just supplying the source mapping and the changed lines. As it turns out, the overriding can be described very simply using an IMPORT style mechanism and the same overriding used for Fallbacks, using the following format.
#IMPORT ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/CP852.TXT 0x1A 0x001C 0x1C 0x007F 0x7F 0x001A 0xAA #UNDEFINED