Proposed Draft Unicode Technical Report #22

Character Mapping Tables

Revision	1.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	1999-11-23
This Version	http://www.unicode.org/unicode/reports/tr22/tr22-1
Previous Version	n/a
Latest Version	http://www.unicode.org/unicode/reports/tr22/

Summary

This document specifies an XML format for the interchange of data for character encodings. It provides a complete description for such encodings in terms of a defined mapping to and from Unicode.

Status

This document contains material which has been considered and approved by the Unicode Technical Committee for publication as a Proposed Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative material or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

1 Introduction
2 XML Format
3 Samples
- 3.1 Full Sample
- 3.2 UTF-8 Sample
  - 3.2.1 Partial Validity Checks
  - 3.2.2 Full Validity Checks

1 Introduction

The ability to seamlessly handle multiple character encodings is crucial in today's world, where a server may need to handle many different client character encodings covering many different markets. No matter how characters are represented, servers need to be able to process them appropriately. Unicode provides a common model and representation of characters for all the languages of the world. Because of this, Unicode is being adopted by more and more systems as the internal storage processing code. Rather than trying to maintain data in literally hundreds of different encodings, a program can translate the source data into Unicode on entry, process it as required, and translate it into a target character set on request.

Even where Unicode is not used as a process code, it is often used as a pivot encoding. Rather than requiring ten thousand tables to map each of a hundred character encodings to one another, data can be transcoded first to Unicode and then into the eventual target encoding. This requires only a hundred tables, rather than ten thousand.

Whether or not Unicode is used, it is ever more vital to maintain the consistency of data across conversions between different character encodings. Because of the fluidity of data in a networked world, it is easy for it to be converted from, say, CP930 on a Windows platform, sent to a UNIX server as UTF-8, processed, and converted back to CP930 for representation on another client machine. This requires implementations to have identical mappings for different character encodings, no matter what platform they are working on. It also requires them to use the same name for the same encoding, and different names for different encodings. This is difficult to do unless there is a standard specification for the mappings so that it can be precisely determined what the encoding maps to.

This technical report provides such a standard specification for the interchange of mapping data that defines a character encoding. By using this specification, implementations can be assured of providing precisely the same mappings as other implementations on different platforms.

This report is in the initial stages of development; feedback is welcome.

1.1 Illegal and Unassigned Codes

Client software needs to distinguish the different types of mismatches that can occur when transcoding data between different character encodings. These fall into the following categories:

The sequence is illegal. There are two variants of this.
First is where the sequence is incomplete. For example,
- 0xA3 is incomplete in CP950.
  - Unless followed by another byte of the right form, it is illegal.
- 0xD800 is incomplete in Unicode.
  - Unless followed by another value of the right form, it is illegal.
- 0xDC00 is incomplete in Unicode.
  - Unless preceded by another value of the right form, it is illegal.
The second variant is where the sequence is complete, but explicitly illegal. For example,
- 0xFFFF is illegal in Unicode. This value can never occur in valid Unicode text, and will never be assigned.
The source sequence represents a valid code point, but is unassigned (aka undefined). This sequence may be given an assignment in some future (evolved) version of the character encoding.
For example,
- 0xA3 0xBF is unassigned in CP950, as of 1999.
- 0x0EDE is unassigned in Unicode, V3.0
The source sequence is assigned, but unmappable: there is no corresponding code point in the target encoding to accurately represent the source sequence.
For example,
- The long dash is assigned in Unicode, but cannot be mapped to ISO-8859-1.

In the case of illegal source sequences, a conversion routine will typically provide the following options:

stop (or throw an exception)
- in particular, stopping is commonly used by higher level character encoding schemes, such as ISO 2022 conversions, to know when to stop converting into one encoding and pick another to convert to.
- in either case, the information as to length of the bad sequence should be available and the conversion should be resumable (after the caller handles the bad sequence).
skip the source sequence
- while this is commonly an option, it can also hide corruption problems in the source text.
map to a substitution character
- such as the Unicode U+FFFD REPLACEMENT CHARACTER.

Note: There is an important difference between the case where a sequence represents a real REPLACEMENT CHARACTER in a legacy encoding, as opposed to just being unassigned, and thereby mapped to REPLACEMENT CHARACTER (using an API substitution option).
Note: An API may choose to signal an illegal sequence in a legacy character set by mapping it to one of the explicit NOT A CHARACTER code points in Unicode (any of the form xxFFFE or xxFFFF). However, this mechanism runs the risk of these values being transmitted in Unicode text (which is non-conformant), and should be used with caution.

Unassigned sequences can be handled with any of the above options, plus some additional ones. They should always be treated as a single code point: for example, 0xA3BF is treated as a single code point when mapping into Unicode from CP950. Especially because unassigned character may actually come from a more recent version of the character encoding, it is often important to preserve round-trip mappings if possible. This can be done with additional options:

map to private use space
- Unicode (and some other character encodings) provides a large area of Private Use characters. These can be used to provide round-trip mappings for private use characters from other character encodings, as well as provisional mappings for characters that have not yet been encoded in Unicode.
represent by a hex escape sequence
- for example, when mapping from U+1234 to other code pages, it can be represented by "%12%34" in URLs, "ሴ" in XML or HTML, "\u1234" in Java, C99 or C++, or "\x{1234}" in Perl.

For unmappable sequences, all of the above options and one additional options may be available:

map to a fallback character sequence
- In this case, an unmappable sequence is given a "best fit" mapping. For example, an encoding might not have curly quotes; the generic quotes can be used as a fallback; or if em dash is unmappable, a sequence of three HYPHEN-MINUS characters can be used as a fallback.

It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences to preserve round-trip integrity; use of fallback characters in that case corrupts the data.

Because illegal values represent some corruption of the data stream, conversion routines may be directed to handle them in a different way than by replacement characters. For example, a routine might map unassigned characters to a substitution character, but throw an exception on illegal values.

1.2 Completeness

It is important that a mapping file be a complete description. From the data in the file, it should be possible to tell for any sequence of bytes whether that sequence is assigned, unassigned, or illegal. It should also be possible to tell if characters need to be rearranged to be in Unicode standard order (visual order, combining marks after base forms, etc).

Unless otherwise indicated in the data file, any sequences of bytes that are not mentioned are assumed to be unassigned.
All control values (C0 controls, DELETE, and C1 controls) should be explicitly mapped.
All private use (e.g. user defined) characters should be explicitly mapped, either to the private use zone in Unicode, or to the correct characters outside of that zone.
Only a real replacement character should be mapped explicitly to REPLACEMENT CHAR; unassigned characters should not be mapped explicitly to it. Similarly, when mapping back from Unicode, only the REPLACEMENT CHAR should map to SUBSTITUTE or other legacy equivalent.
Incomplete sequences and other illegal sequences should be explicitly indicated.
All fallback mappings must be clearly indicated. This is especially important for modern software that guarantees round-trip conversion to and from Unicode.

1.3 Canonical Equivalence

The Unicode Standard has two equivalent ways of representing composite characters such as â. The standard provides for two normalized formats that provide for unique representations of data in UTR #15: Unicode Normalization Forms. The standard format for character encoding specification itself is to map to sequences of Unicode characters in Normalization Form C. However, this does not guarantee that the result of transcoding into Unicode will be normalized, since individual characters in the source encoding may separately map to an unnormalized sequence.

For example, suppose the source encoding maps 0x83 to 0x030A in Unicode (combining ring above), and 0x61 to 0x0061 (a). Then the sequence <0x61,0x83> will map to <0x0061,0x030A> in Unicode, which is not in Normalization Form C.

This problem will only arise when the source encoding has separate characters that, in the proper context, would not be present in normalized text. If a process wishes to guarantee that the result is in a particular Unicode normalization form, then it should normalize after transcoding. Information is provided below that can determine whether this step is required.

2 XML Format

A character mapping specification file starts with the following lines. There is a difference between the encoding of the XML file, and the encoding of the mapping data. The encoding of the file can be any valid XML encoding. Only the ASCII repertoire of characters is required in the specification of the mapping data, but comments may be in other character encodings. The example below happens to use UTF-8.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE characterMapping
  SYSTEM "http://www.unicode.org/unicode/reports/tr22/CharacterMapping.dtd">

Note: In the rest of this specification, short attribute and element names are used just to conserve space where there may be a large number of items, or for consistency with other elements that may have a large number of items.

2.1 Header

A mapping file begins with a comment header. Here is an (artificial) example:

<characterMapping
 name="CP938"
 description="Sun variant of CP942 for Japanese"
 unicodeVersion="3.0"
 tableVersion="2"
 contact="mark@unicode.org"
 registrationAuthority="IBM"
 registrationName="CP935"
 copyright="SomeCompany"
 bidiOrder="logical"
 combiningOrder="after"
 normalization="C"
>

The characterMapping element is the root. It contains a number of required attributes:

name is a string which uniquely identifies this mapping table from all others. Where possible this should the preferred name from the IANA character set registry. This string must be limited to the minimal character repertoire of ASCII letters, digits, plus '-', '.', ':', and '_'. The name value is not case-sensitive. It must be unique; if two mapping tables differ in map any characters, in the specification of illegal characters, in their bidi ordering, in their combining character ordering, etc. then they must have a different name (or different version: see below).

description is a string which describes the mapping enough to distinguish it from other similar mappings. This string must be limited to the Unicode range 0x0020 - 0x007E and should be in English. The string normally contains the set of mappings, the script, language, or locale for which it is intended, and optionally the variation. For instance, "Windows Japanese JIS-1990", "EBCDIC Latin 1 with Euro", "PC Greek".

unicodeVersion is the earliest version of the Unicode standard (after 2.0) that contains all of the characters mapped to. That is, most of the ISO 8859 series will use Unicode 2.0; the new ones with Euro will use Unicode 2.1.

tableVersion is the version of the data, a small integer normally starting at one. Any time the data is modified, the value must be increased. If only additions are made, then the same name can be retained; if not, then a new name must be used. Additions change mappings from "unassigned" to "assigned". Any change in the validity of character sequences requires a new name.

contact is the person to contact in case errors are found in the data. This must be an e-mail address or URL.

registrationAuthority is the organization responsible for the encoding.

registrationName is a string that provides the name and version of the mapping, as known to that authority.

copyright provides the copyright information. While this can be provided in comments, use of a field allows copyright propagation when converting to a binary form of the table. (Typically the right to use the information is granted, but not the right to erase the copyright or pretend that you created the information.)

contact is the person to contact in case errors are found in the data. This must be an e-mail address or URL.

bidiOrder specifies whether the character encoding is to be interpreted in one of two orders: "visual" or "logical". Unicode is strictly logical order. Application of the Unicode Bidirectional Algorithm is required to map to a visual-order character encoding; application of a reverse bidirectional algorithm is required to map back to Unicode. The default value for this attribute is "logical". It is only relevant for character encodings for the Middle East (Arabic and Hebrew). For more information, see UTR #9: The Bidirectional Algorithm.

combiningOrder specifies the order of combining marks: either "before" or "after". Some character encodings, typically those for bibliographic use, store combining marks before base characters. Unicode stores them uniformly after base characters. The default value for this attribute is "after". This is only relevant for character encodings with combining marks.

normalization specifies whether the result of transcoding into Unicode using this mapping will be automatically in Normalization Form C or D. The possible values are "neither", "C", "D", "CD". While this information can be derived from an analysis of the assignment statements (see UTR #15: Unicode Normalization Forms), providing the information in the header is a useful validity check. Most mappings specifications will have the value "C". Character encodings that contain neither composite characters nor combining marks (such as 7-bit ASCII) will have the value "CD".

2.2 History

 <history supercedes="CP501" derivedFrom="CP500">
  <modified version="2" date="1999-09-25">
   Added Euro.
  </modified>
  <modified version="1" date="1997-01-01">
   Made out of whole cloth for illustration.
  </modified>
 </history>

history provides information about the changes to the file and relations to other encodings. This is a required element.

modified provides information about the changes to the file, coordinated with the version. The latest version should be first.

supercedes is an optional attribute that indicates a relation to another encoding. This encoding supercedes another when all of the assigned values in the other mapping are contained in this one, and there are additional assigned values.

derivedFrom is an optional attribute that indicates a relation to another encoding. This encoding derives from another encoding when it was formed by replacing some of the assigned values by different assignments. For instance, Cp1148 is derived from Cp500 with Euro substituted for a currency sign.

2.3 Names

 <aliases>
  <!--List of aliases, such as IANA names-->
  <n n="MS983"/>
  <n n="SJIS"/>
 </aliases>

 <displayNames>
  <!--List of display names for this encoding in different languages-->
  <d xml:lang="en" n="Sun Chinese"/>
  <d xml:lang="fr-BE" n="Sun Chinoise"/>
 </displayNames>

The aliases element provides a list of possible aliases for this code page. It is optional. The names may not be unique, because of the history behind the development of character encoding names. The n attribute is used to supply the name. Aliases should only be provided where the character encoding mappings are known to match this table precisely. Related mappings can be included in the history element.

The displayNames element is optional, but strongly recommended. It provide user-level names that can be presented in menus, such as in Netscape Navigator View>Character Set or the Microsoft Internet Explorer View>Encoding. The individual names are supplied with the d elements. The xml:lang attribute supplies the locale in the format. The n attribute supplies the name. Both attributes are required.

2.4 Imports

 <import source="ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/CP852.XML"/>

It is possible to supply just the differences between one table and a base table. This is done with the import element, which is optional. If this is used, then any further data simply overrides the data in the base table. The value of the source attribute is a valid URL pointing to a valid character encoding table.

2.5 Validity Specification

As discussed above, it is important to be able to distinguish when characters are unassigned vs. when they are invalid. Valid and invalid sequences are specified by the validity element. Here is an example of what this might look like:

 <validity>
  <!--Validity specification for SJIS-->
  <illegal s="FD" e="FF"/>

  <legal s="81" e="9F" next="second" />  
  <legal s="E0" e="FC" next="second" />

  <legal type="second" s="40" e="7E"/>
  <legal type="second" s="80" e="FC"/>
 </validity>

The subelements are legal or illegal. Their attributes are:

type the type of the given bytes; the default = "start".
s the start of the byte range
e the end of the byte range; the default is "-1".
- A value of "-1" is interpreted as being the same as the value for s (thus is a range with one single value).
next the type that the following bytes are required to be in; the default is "end", which indicates completion

All values referring to code units are in hexadecimal. If we look at the above table, the first line tells us that the single bytes FD through FF are illegal. The next two lines say that the bytes in the ranges 81 through 9F and E0 through FC are legal, if they are followed by a byte of type="second". More detailed samples for a complex validity specifications are given in Samples.

If any bytes are not explicitly set for type="start", then they are assumed to be legal with next="end". Thus most single-byte encodings do not need validity elements. Any string can be used for the value of type or next, as long as it is not subject to an error condition.

Error Conditions

Two lines conflict if they assign the same byte to a different type, or the same byte to a different next, or if one line makes the byte legal and the other makes it illegal.
In the case of conflicts, if one of the conflicting lines is from an import then the byte ranges are adjusted to exclude the conflicting bytes (possibly generating multiple lines). Otherwise the file is invalid.
If there is a type value (other than "start") with no matching next value in another line, the line is incomplete.
If there is a next value (other than "end") with no matching type value in another line, the line is incomplete.
If an incomplete line is from an import then it is disregarded. Otherwise the file is invalid.

2.6 Assignments

The main part of the table provides the assignments of mappings between byte sequences and Unicode characters. Here is an example:

 <assignments sub="A3">

  <!--Unassignments-->
  <a b="AA"/>
  <a b="AB"/>

  <!--Fallbacks-->
  <a b="22" u="201C" f="u" n="LEFT DOUBLE QUOTATION MARK"/>
  <a b="22" u="201D" f="u" n="RIGHT DOUBLE QUOTATION MARK"/>

  <!--Main mappings-->

  <!--Map ASCII to the same range-->
  <a b="00" u="0000" c="7F"/> <!--maps 00..7F to 0000..007F-->

  <!--Map user-defined area to private use-->
  <a b="F040" u="E000" e="3E"/> <!--maps F040..F03E to E000..E07E-->

  <!--Map other characters specifically-->
  <a b="A1" u="FF61" n="HALFWIDTH IDEOGRAPHIC FULL STOP"/>
  <a b="A2" u="FF62" n="HALFWIDTH LEFT CORNER BRACKET"/>
  <a b="8156" u="3003" n="DITTO MARK"/>
  <a b="8157" u="4EDD"/>
  <a b="8158" u="3005" n="IDEOGRAPHIC ITERATION MARK"/>
  <a b="8159" u="3006" n="IDEOGRAPHIC CLOSING MARK"/>
  <a b="815A" u="3007" n="IDEOGRAPHIC NUMBER ZERO"/>
  <a b="815B" u="30FC" n="KATAKANA-HIRAGANA PROLONGED SOUND MARK"/>
  <a b="815C" u="2015" n="HORIZONTAL BAR"/>
  
 </assignments>

sub is an attribute that specifies the replacement character used in the legacy character encoding. (U+FFFD REPLACEMENT CHARACTER is used in Unicode.) The value is a sequence of bytes, as described under b below. The default is the ASCII control value SUB = "1A".

The element a specifies an mapping from byte sequences to Unicode and back. It has the following attributes:

b is a sequence of bytes. Always 2 hex digits, unsigned. Multiple values should be separated by spaces, but the spaces can also be omitted.
u is a sequence of Unicode code points. One or more hex digits, unsigned. Multiple values must be separated by spaces.
- If this attribute is missing, that signals an unassignment: that the byte sequence is unmapped. Normally this is not needed, since any sequence that is not explicitly assigned is assumed to be unassigned. One exception is if there is an import statement, where it may be necessary to override specific mappings. Another exception is to remove a mapping in an alternate mapping, as described below.
c is a count. This can be used to map a range of values in one statement. This is a shorthand for a series of mappings--especially useful for private use zone assignments. All byte sequences in the range between b and b+c are mapped to the range from u to u+c. Be careful not to include illegal sequences in the range. This is present purely to reduce the number of lines in the file; in every respect (such as in error conditions), it should be treated as if it were simply an expanded series of elements.
f indicates a fallback mapping for unmappable characters, to be used if an API requests a "best effort". The values are "b" and "u".
- The value "b" indicates that the fallback is used when mapping from byte sequences to Unicode sequences.
- The value "u" indicates that the fallback is used when mapping from Unicode sequences to byte sequences.
n is string to be used to identify the character to readers. This may or may not be equal to the Unicode name. A vendor may wish to put the vendor description or other useful information in this field. This is optional, but useful for reading the file and performing consistency checks. In the above example it is omitted for the CJK ideograph since the name adds little information.
x marks an alternative mapping. The value is a label on assignment lines used to add alternate mappings to the same file. Multiple labels can be attached to the same line, using spaces between the labels. The choice of names is arbitrary, except in so far as they might be used in an API to specify a variant of the character encoding. If they are specified, then they override the normal mappings to and from Unicode. Alternate mappings are bidirectional, and may also unassign regular mappings. Values are:
- "path" indicates that the mapping is for pathnames.
  - For example, maps 5C to 005C "\" instead of 00A5 "¥".
- "graphics1" indicates that C0 control bytes in the source set are interpreted as graphic characters for old PC sets.
  - For example, maps 10 to 25BA BLACK RIGHT-POINTING POINTER.
- "graphics2" indicates that control bytes in the source set, except for CR and LF are interpreted as graphic characters for old PC sets.
  - For example, maps 10 to 25BA BLACK RIGHT-POINTING POINTER.
- "cdra" indicates that control bytes in the source set are interpreted according to IBM CDRA alternate mappings.
  - For example, maps 7F to 1A SUB.
- Alternative mappings that do not need to be specifically declared are "CR", "LF", and "CRLF". If these are included in a specification of the table, they control the mapping to or from these values. For example, if "CR" is specified when converting from Unicode, then any instance of 0x000A or 0x000D0x000A is automatically mapped to the same byte sequence as 0x000D is. For more information, see UTR #13: Unicode Newline Guidelines.
For example, using "Cp932/path/LF" specifies that Cp932 is to be used, but with a backslash instead of a yen sign, and that CRLF and CR are to be converted to LF.

Error Conditions

All byte sequences must be valid according to the validity specification. Otherwise the file is invalid.
All Unicode sequences must be legal, assigned Unicode code points. Otherwise the file is invalid.
- The illegal code points are: out-of-range values (greater than 10FFFF), unpaired surrogate values (D800 to DF00), and non-character values (of the form xxFFFF or xxFFFE).
- Sequences cannot map to code points that are unassigned in the latest version of the Unicode Standard. If there are valid characters in the legacy encoding that are not in Unicode yet, they must be mapped to private use characters.
Two assignment lines conflict if they have either the same byte sequence or the same Unicode sequence.
A fallback line with the value "u" conflicts with any other assignment line or fallback="u" line that has the same Unicode sequence.
A fallback line with the value "b" conflicts with any other assignment line or fallback="b" line that has the same byte sequence.
In the case of conflicts, if one of the conflicting lines is from an import then it is disregarded. If the conflicting lines have different x (alternate) values, that does not cause a problem. Otherwise the file is invalid.

Open Issue: if we required that all valid byte sequences be either explicitly assigned or unassigned, then that would provide more error checking for the file, at the expense of having to specify extra unassignments.

3 Samples

The following provide samples that illustrate features of the format.

3.1 Full Sample

A full example is on CharacterMapping.xml. It is not a real example since it tries to show all of the features in one file, whereas in real life only a subset would be used. You can view it directly with Internet Explorer, which will interpret the XML. The DTD is on CharacterMapping.dtd (if you are looking at this in a browser, choose the View Source menu item).

3.2 UTF-8 Sample

Here is a simple version of the UTF-8 validity specification, with the shortest-form bounds checking and exact limit bounds checking omitted. While in practice a mapping file is never required for UTF-8 since it is algorithmically derived, it is instructive to see the use of the validity element as a complicated example. As a reminder, first here are the valid ranges for UTF-8:

**Figure 2: UTF-8 Boundaries**
Unicode Code Points	UTF-8 Code Units
00	00
7F	7F
80	C2 80
7FF	DF BF
800	E0 A0 80
FFFF	EF BF BF
010000	F0 90 80 80
10FFFF	F4 8F BF BF

3.2.1 Partial Validity Checks

Here is a simple version of the UTF-8 validity specification, with the shortest-form bounds checking and exact limit bounds checking omitted. This specification only checks the bounds for the first byte, and that there are the appropriate number (0, 1, 2, or 3) of following bytes in the right ranges. The single byte form does not need to be explicitly set; it is simply any single byte that neither is illegal nor requires additional bytes.

<validity>

 <!--Validity specification for UTF-8, partial boundary checks-->
 <illegal s="80" e = "BF"/>
 <illegal s="F5" e = "FF"/>

 <!-- 2 byte form -->
 <legal s="C0" e="DF" next="final" />
 <legal type="final" s="80" e="BF" />

 <!-- 3 byte form -->
 <legal s="DF" e="EF" next="prefinal" />
 <legal type="prefinal" s="80" e="BF" next="final" /> 

 <!-- 4 byte form -->
 <legal s="F0" e="F4" next="preprefinal" />
 <legal type="preprefinal" s="80" e="BF" next="prefinal" />

</validity>

3.2.2 Full Validity Checks

The following provides the full validity specification for UTF-8, as shown in Figure 2: UTF-8 Boundaries.

<validity>

 <!--Validity specification for UTF-8, full boundary checks-->
 <illegal s="80" e = "C1"/>
 <illegal s="F5" e = "FF"/>

 <!-- 2 byte form -->
 <legal s="C2" e="DF" next="final" />
 <legal type="final" s="80" e="BF"/>

 <!-- 3 byte form; Low range is special-->
 <legal s="E0"        next="prefinalLow" /> 
 <legal type="prefinalLow" s="A0" e="BF" next="final" /> 

 <!-- 3 byte form, Normal -->
 <legal s="E1" e="EF" next="prefinal"  />
 <legal type="prefinal"  s="80" e="BF" next="final" /> 

 <!-- 4 byte form, Low range is special -->
 <legal s="F0"        next="preprefinalLow" /> 
 <legal type="preprefinalLow" s="90" e="BF" next="prefinal"/>

 <!-- 4 byte form, Normal -->
 <legal s="F1" e="F3" next="preprefinal"   />
 <legal type="preprefinal" s="80" e="BF" next="prefinal" />

 <!-- 4 byte form, High range is special-->
 <legal s="F4"        next="preprefinalHigh" />
 <legal type="preprefinalHigh" s="80" e="8F" next="prefinal"/> 

</validity>

Acknowledgments

Thanks to Karlsson Kent, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, and Ken Whistler for their feedback on the document.

Copyright © 1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Proposed Draft Unicode Technical Report #22

Character Mapping Tables

Summary

Status

Contents

1 Introduction

1.1 Illegal and Unassigned Codes

1.2 Completeness

1.3 Canonical Equivalence

2 XML Format

2.1 Header

2.2 History

2.3 Names

2.4 Imports

2.5 Validity Specification

Error Conditions

2.6 Assignments

Error Conditions

3 Samples

3.1 Full Sample

3.2 UTF-8 Sample

3.2.1 Partial Validity Checks

3.2.2 Full Validity Checks

Acknowledgments