An XML representation for the UCD

L2/04-220

An XML representation for the UCD

Eric Muller, Adobe Systems Inc.
June 6, 2004

Database

5.1.	Designation
5.2.	Groups
5.3.	Age
5.4.	Name Properties
5.5.	General Category
5.6.	Combining Properties
5.7.	Bidirectionality Properties
5.8.	Decomposition Properties
5.9.	Numeric Properties
5.10.	Joining Properties
5.11.	Linebreak Properties
5.12.	East Asian Width Properties
5.13.	Case Properties
5.14.	Script Properties
5.15.	ISO Comment Properties
5.16.	Hangul Syllable Type
5.17.	Unihan properties

Blocks

Complete schema

Examples

Document History

1. Introduction

In working on Unicode implementations, it is often useful to access the full content of the Unicode character database (UCD). For example, in establishing mappings from characters to glyphs in fonts, it is convenient to see the character scalar value, the character name, the character cross-references, the character east asian width, along with the shape and metrics of the proposed glyph to map to; looking at all this data simultaneously helps in evaluating the mapping.

Accessing directly the data files that constitute the UCD is sometime a daunting proposition. The data is dispersed in a number of files of various formats, and there are just enough peculiarities (all justified by the processing power available at the time the UCD representation was designed) to require a fairly intimate knowledge of the data format itself, in addition to the meaning of the data.

Many programming environments (e.g. Java or ICU) do give access to the UCD. However, those environments tend to lag behind releases of the standard, or support only some of the UCD content.

Unibook is a wonderful tool to explore the UCD and in many cases is just the ticket; however, it is difficult to use when the task at hand has not been built-in, or when non-UCD data is to be displayed along.

This paper presents an alternative representation of the UCD, which is meant to overcome these difficulties. We have chosen an XML representation, because parsing becomes a non-issue: there are a number of XML parsers freely available, and using them is often fairly easy. In addition, there are freely available tools that can perform powerful operations on XML data; for example, XPATH and XQUERY engines can be thought of a “grep” for XML data and XSLT engines can be thought of as “awk” for XML data.

It is important to note that we are interested in exploring the content of the UCD, rather than using the UCD data in processing to character streams. Thus, we are not concerned so much by the speed of processing or the size of our representation.

Our representation supports the creation of documents that represent only parts of the UCD, either by not representing all the characters, or by not representing all the properties. This can be useful when only some of the data is needed.

2. General principles

Our schema defines a set of valid documents which are intended to represent properties of Unicode code points and the characters assigned to them. A document may represent the values actually assigned in a given version of the UCD, or it may represent a draft version of the UCD, or a private agreement on Private Use Area characters. The validity of a document does not assert anything on the correctness of the values.

Valid documents may provide values for only some of the Unicode properties. Furthermore, they may also give non-Unicode properties.

Our schema is defined using English. However, a useful subset of the validity constraints can be captured using a schema language, thereby simplifying the task of validating documents. We have chosen Relax NG as the schema language. It is important to stress that the Relax NG schema does not define valid documents.

A design principle for our schema is that it supports the relatively efficient representation of the UCD. This is achieved by an inheritance mechanism, similar to property inheritance in CSS or in XSL-FO.

Characters are pervasive in the UCD, and will need to be represented somehow. Representing characters directly by themselves would seem the most obvious choice; for example, we could express that the decomposition of U+00E8 is “è”, i.e. have exactly two characters in (the infoset of) the XML document. However, the current XML specification limits the set of characters that can be part of a document. Another problem is that the various tools (XML parser, XPATH engine, etc.) may equate U+00E8 with U+0065 U+0300, thus making it difficult to figure out which of the two sequences is contained in the database (which is sometimes important for our purposes). Therefore, we chose instead to represent characters by their code points; we follow the usual convention of four to six hexadecimal digits (uppercase) and code points in a sequence separated by space; e.g., the decomposition of U+00E8 will be represented by the nine characters “0065 0300” in the infoset.

3. Namespace

The namespace for our elements is “http://www.unicode.org/ns/2003/ucd/1.0”. Our attributes are in the empty namespace.

[Namespace declaration] == default namespace ucd = "http://www.unicode.org/ns/2003/ucd/1.0"

In all our examples, we assume that this namespace is the default one.

Non-Unicode properties can be represented by using elements in another (possibly empty) namespace, and/or by attribute in a non-empty namespace. Such elements and attributes are ignored for the purpose of determining the validity of a document.

4. Database

The root element of valid documents is a ucd element.

To facilitate the identification of a collection, this element may have an attribute desc, which is any string. It is recommended that if the document purports to represent the UCD of some Unicode version, the desc be selected in accord with the rules listed at http://www.unicode.org/versions/; and conversely, that documents which do not purport to represent the UCD be described as such.

[Document root element] == start = element ucd { attribute desc {text}?, ucd.content }

The following sections detail the content of a ucd element.

5. Code points

The repertoire element, a child of ucd, describes the code points.

[Repertoire element] == ucd.content &= element repertoire { (group | code-point)* }?

5.1. Designation

The code-point element represents the use of a code point.

A mandatory type attribute indicates whether the code point is reserved, or designated as a noncharacter, a surrogate or has been assigned to an abstract character. A mandatory cp attribute records the code point in question. Other attributes and subelements are used to represent the properties of that code point

[code-point element] == code-point |= element code-point { attribute cp { text }, attribute type { "reserved" | "noncharacter" | "surrogate" | "char" }, code-point.attributes }

For convenience, the following elements are equivalent to a code-point with a specific type:

[code-point+type elements] == code-point |= element reserved { attribute cp { text }, code-point.attributes } code-point |= element noncharacter { attribute cp { text }, code-point.attributes } code-point |= element surrogate { attribute cp { text }, code-point.attributes } code-point |= element char { attribute cp { text }, code-point.attributes }

5.2. Groups

It is often the case that many code points share the same values of some property or properties. For example, the characters U+1740 BUHID LETER A .. U+1753 BUHID VOWEL SIGN U all have the age “3.2”, and all have the script “Buhd”. On the one hand, it is convenient to support data files in which those properties are explicitly listed with every code point, at this make answering questions like “what is the age of U+1749” easier, since there is no context. On the other hand, this leads to rather large data files, and it also tends to obscure the differences between similar characters.

Our representation accomodates both scenarios by having the notion of groups. A group is simply a container of code points that also holds default values for the properties. If a code point inside a group does not list explicitly a property but the group lists it, then the code point inherits that property from its group. For example, the fragment with explicit properties:

is equivalent to this fragment which uses a group:

As this example illustrates, the notion of group does not necessarily align with the notion of Unicode block. It is entirely defined and limited to our representation. In particular, the value of a property for a code point can always be determined from the XML document alone, assuming that this property and this code point are expressed at all. Of course, one may create an XML representation where the groups happen to coincide with the Unicode blocks.

Groups cannot be nested; this simplifies the discovery of inherited attributes, as they are precisely in the parent element.

The unified Han ideographs have a very special structure: they share the same basic properties, and their names are all of the form “CJK UNIFIED IDEOGRAPH-cp”, where cp is their code point. The grouping mechanism established so far would take care of the basic properties, but not of the names. To accomodate this, we have a further convention: if the name property on a group contains the character U+002A * ASTERISK, then the inherited value is obtained by replacing this character by the code point. For example:

is equivalently represented by:

We need a final piece of infrastructure to futher reduce the size of the data files: if a group contains a number of char elements which only have a cp attribute each, and the values of those attributes are an interval (contiguous and no gaps), then it can equivalently be represented by placing the attributes first-char and last-char on the group element:

These two mechanisms, “*” in names, and first-char/last-char are independent, and are not restricted to CJK unified ideographs. For example, the first can be applied in this case:

is equivalently represented by:

It is important to stress that the mechanism of groups and the special syntax for names are entirely defined by our representation and do not depends on anything in the Unicode standard itself. It should be possible to build a program that takes a document that uses groups and creates another equivalent document that does not use them, and uses only the text of this section to do so.

Here is our schema for groups:

[Group] == group = element group { (attribute first-cp { text }, attribute last-cp { text })?, attribute type { "reserved" | "noncharacter" | "surrogate" | "char" }?, code-point.attributes, code-point* }

5.3. Age

The age of a code point is represented by the age attribute. Technically, reserved code points do not have an age, but our schema does not reflect that.

[Age attribute for char] == code-point.attributes &= attribute age { text }?

5.4. Name Properties

There are two name properties: the name given by the current version of the standard (na), and possibly the name this character had in version 1.0 of the standard (na1).

[Names attributes for char] == code-point.attributes &= attribute na { text }? code-point.attributes &= attribute na1 { text }?

5.5. General Category

The general category is represented by the gc attribute. The possible values are those listed in TUS 4.0, table 4.2.

[General Category attributes for char] == code-point.attributes &= attribute gc { "Lu" | "Ll" | "Lt" | "Lm" | "Lo" | "Mn" | "Mc" | "Me" | "Nd" | "Nl" | "No" | "Pc" | "Pd" | "Ps" | "Pe" | "Pi" | "Pf" | "Po" | "Sm" | "Sc" | "Sk" | "So" | "Zs" | "Zl" | "Zp" | "Cc" | "Cf" | "Cs" | "Co" | "Cn" }?

5.6. Combining Properties

The combining class is represented by the ccc attribute, which holds the decimal representation of the combining class.

[Combining attributes for char] == code-point.attributes &= attribute ccc { text }?

5.7. Bidirectionality Properties

The bidirectional category is represented by the bc attribute. The possible values are those listed in TUS 4.0, table 3.8

The mirrored property is represented by the Bidi_M attribute, which can take the values “Y” or “N”.

If the mirrored property is true, then the bmg attribute may be present. Its value is the code point of a character whose glyph is typically a mirrored image of a typical glyph for the current character.

Note that we do not express the “Best Fit” element recorded in BidiMirroring.txt. For one thing, it is not meant to be machine readable. More importantly, the idea underlying the mirrored glyph is delicate to use, since it make assumptions about the design of the fonts, and the best fit goes even farther.

[Bidirectionality attributes for char] == code-point.attributes &= attribute bc { "AL" | "AN" | "B " | "BN" | "CS" | "EN" | "ES" | "ET" | "L" | "LRE" | "LRO" | "NSM" | "ON" | "PDF" | "R" | "RLE" | "RLO" | "S" | "WS" }? code-point.attributes &= attribute Bidi_M { "Y" | "N" }? code-point.attributes &= attribute bmg { text }?

5.8. Decomposition Properties

The decomposition type is represented by the dt attribute. The possible values are can for characters with a canonical decomposition, no for characters without a decomposition (either canonical or compatibility) or the tag of a compatibility decomposition (using the values defined by PropertyAliases).

If the decomposition type is not no, then the decomposition mapping, recorded by the dm attribute, is meaningful. The value of this attribute is code point sequence into which this character decomposes.

[Decomposition attributes for char] == code-point.attributes &= attribute dt { "can" | "com" | "enc" | "fin" | "font" | "fra" | "init" | "iso" | "med" | "nar" | "nb" | "sml" | "sqr" | "sub" | "sup" | "vert" | "wide" | "no"}? code-point.attributes &= attribute dm { text }?

5.9. Numeric Properties

The numeric type is represented by the nt attribute. The possible values are:

de if the character has a decimal digit value
di if the character has a digit value and no decimal digit value
nu if the character has the numeric value property, but does not have a decimal digit value nor a digit value
no otherwise

If the numeric type is not no, then the numeric value is represented by the nv attribute, which holds the corresponding sequence of code points from the UnicodeData.txt database file.

[Numeric attributes for char] == code-point.attributes &= attribute nt { "no" | "de" | "di" | "nu" }? code-point.attributes &= attribute nv { text }?

5.10. Joining Properties

The joining class of a character is represented by the jt attribute. The possible values are those listed in table 8.2 of the standard.

If the joining class is neither “U”, “C” nor “T”, then the jg attribute is the joining group of the character.

[Joining attributes for char] == code-point.attributes &= attribute jt { "C" | "T" | "U" | "D" | "L" | "R" }? code-point.attributes &= attribute jg { text }?

5.11. Linebreak Properties

The linebreak property is represented by the lb attribute. The possible values are those listed in Table 1 of UTR 14.

[Linebreak attributes for char] == code-point.attributes &= attribute lb { "AI" | "AL" | "B2" | "BA" | "BB" | "BK" | "CB" | "CL" | "CM" | "CR" | "EX" | "GL" | "HY" | "ID" | "IN" | "IS" | "LF" | "NL" | "NS" | "NU" | "OP" | "PO" | "PR" | "QU" | "SA" | "SG" | "SP" | "SY" | "WJ" | "XX" | "ZW" }?

5.12. East Asian Width Properties

The east asian width property is represented by the ea attribute. The possible values are the abbreviated names listed in section 4 of UTR 11.

[East Asian Width attributes for char] == code-point.attributes &= attribute ea { "A" | "F" | "H" | "N" | "Na" | "W" }?

5.13. Case Properties

If a character is cased (that is, its general category is Lu, Ll or Lt), then simple case mappings are recorded using the suc, slc, stc attributes. These values of these attributes are the character sequences.

[Case mapping attributes for char] == code-point.attributes &= attribute suc { text }? code-point.attributes &= attribute slc { text }? code-point.attributes &= attribute stc { text }?

If the character has non-simple casing, this is captured by the uc, lc and tc attributes:

[Case mapping attributes for char] == code-point.attributes &= attribute uc { text }? code-point.attributes &= attribute lc { text }? code-point.attributes &= attribute tc { text }?

The case foldings are recorded in ccf (common), scf (simple), fcf (full) and tcf (Turkic):

[Case mapping attributes for char] == code-point.attributes &= attribute ccf { text }? code-point.attributes &= attribute scf { text }? code-point.attributes &= attribute fcf { text }? code-point.attributes &= attribute tcf { text }?

5.14. Script Properties

The script property is represented by the sc attribute, using the values specified by PropetyAliases.

[Script attributes for char] == code-point.attributes &= attribute sc { "Arab" | "Armn" | "Beng" | "Bopo" | "Brai" | "Buhd" | "Cans" | "Cher" | "Cprt" | "Cyrl" | "Deva" | "Dsrt" | "Ethi" | "Geor" | "Goth" | "Grek" | "Gujr" | "Guru" | "Hang" | "Hani" | "Hano" | "Hebr" | "Hira" | "Hrkt" | "Ital" | "Kana" | "Khmr" | "Knda" | "Laoo" | "Latn" | "Limb" | "Linb" | "Mlym" | "Mong" | "Mymr" | "Ogam" | "Orya" | "Osma" | "Qaai" | "Runr" | "Shaw" | "Sinh" | "Syrc" | "Tagb" | "Tale" | "Taml" | "Telu" | "Tglg" | "Thaa" | "Thai" | "Tibt" | "Ugar" | "Yiii" | "Zyyy" }?

5.15. ISO Comment Properties

The ISO 10646 comment field is represented by the isc attribute.

[ISO comment attributes for char] == code-point.attributes &= attribute isc { text }?

5.16. Hangul Syllable Type

The Hangual Syllable Type is represented by the hst attribute.

[Hangul Syllable Type for char] == code-point.attributes &= attribute hst { "NA" | "L" | "V" | "T" | "LV" | "LVT" }?

5.17. Unihan properties

The Unihan properties (from Unihan.txt) are represented as attributes

[Unihan attributes for char] == code-point.attributes &= attribute kAccountingNumeric { text }? code-point.attributes &= attribute kAlternateHanYu { text }? #old code-point.attributes &= attribute kAlternateKangXi { text }? code-point.attributes &= attribute kAlternateMorohashi { text }? code-point.attributes &= attribute kBigFive { text }? code-point.attributes &= attribute kCCCII { text }? code-point.attributes &= attribute kCNS1986 { text }? code-point.attributes &= attribute kCNS1992 { text }? code-point.attributes &= attribute kCangjie { text }? code-point.attributes &= attribute kCantonese { text }? code-point.attributes &= attribute kCihaiT { text }? code-point.attributes &= attribute kCompatibilityVariant { text }? code-point.attributes &= attribute kCowles { text }? code-point.attributes &= attribute kDaeJaweon { text }? code-point.attributes &= attribute kDefinition { text }? code-point.attributes &= attribute kEACC { text }? code-point.attributes &= attribute kFenn { text }? code-point.attributes &= attribute kFrequency { text }? code-point.attributes &= attribute kGB0 { text }? code-point.attributes &= attribute kGB1 { text }? code-point.attributes &= attribute kGB3 { text }? code-point.attributes &= attribute kGB5 { text }? code-point.attributes &= attribute kGB7 { text }? code-point.attributes &= attribute kGB8 { text }? code-point.attributes &= attribute kGradeLevel { text }? code-point.attributes &= attribute kGSR { text }? code-point.attributes &= attribute kHanYu { text }? code-point.attributes &= attribute kHanyuPinlu { text }? code-point.attributes &= attribute kHKGlyph { text }? code-point.attributes &= attribute kHKSCS { text }? code-point.attributes &= attribute kIBMJapan { text }? code-point.attributes &= attribute kIRGDaeJaweon { text }? code-point.attributes &= attribute kIRGDaiKanwaZiten { text }? code-point.attributes &= attribute kIRGHanyuDaZidian { text }? code-point.attributes &= attribute kIRGKangXi { text }? code-point.attributes &= attribute kIRG_GSource { text }? code-point.attributes &= attribute kIRG_HSource { text }? code-point.attributes &= attribute kIRG_JSource { text }? code-point.attributes &= attribute kIRG_KPSource { text }? code-point.attributes &= attribute kIRG_KSource { text }? code-point.attributes &= attribute kIRG_TSource { text }? code-point.attributes &= attribute kIRG_VSource { text }? code-point.attributes &= attribute kIRG_USource { text }? code-point.attributes &= attribute kJIS0213 { text }? code-point.attributes &= attribute kJapaneseKun { text }? code-point.attributes &= attribute kJapaneseOn { text }? code-point.attributes &= attribute kJis0 { text }? code-point.attributes &= attribute kJis1 { text }? code-point.attributes &= attribute kKPS0 { text }? code-point.attributes &= attribute kKPS1 { text }? code-point.attributes &= attribute kKSC0 { text }? code-point.attributes &= attribute kKSC1 { text }? code-point.attributes &= attribute kKangXi { text }? code-point.attributes &= attribute kKarlgren { text }? code-point.attributes &= attribute kKorean { text }? code-point.attributes &= attribute kLau { text }? code-point.attributes &= attribute kMainlandTelegraph { text }? code-point.attributes &= attribute kMandarin { text }? code-point.attributes &= attribute kMatthews { text }? code-point.attributes &= attribute kMeyerWempe { text }? code-point.attributes &= attribute kMorohashi { text }? code-point.attributes &= attribute kNelson { text }? code-point.attributes &= attribute kOtherNumeric { text }? code-point.attributes &= attribute kPhonetic { text }? code-point.attributes &= attribute kPrimaryNumeric { text }? code-point.attributes &= attribute kPseudoGB1 { text }? code-point.attributes &= attribute kRSJapanese { text }? code-point.attributes &= attribute kRSKanWa { text }? code-point.attributes &= attribute kRSKangXi { text }? code-point.attributes &= attribute kRSKorean { text }? code-point.attributes &= attribute kRSUnicode { text }? code-point.attributes &= attribute kSemanticVariant { text }? code-point.attributes &= attribute kSBGY { text }? code-point.attributes &= attribute kSimplifiedVariant { text }? code-point.attributes &= attribute kSpecializedSemanticVariant { text }? code-point.attributes &= attribute kTaiwanTelegraph { text }? code-point.attributes &= attribute kTang { text }? code-point.attributes &= attribute kTotalStrokes { text }? code-point.attributes &= attribute kTraditionalVariant { text }? code-point.attributes &= attribute kVietnamese { text }? code-point.attributes &= attribute kXerox { text }? code-point.attributes &= attribute kZVariant { text }?

6. Blocks

The Unicode blocks are represented in the block element, which is a children of the ucd element.

[Blocks element] == ucd.content &= element blocks { block* }?

Each block is represented by a block element. The representation used the attributes first, last and name in the obvious way:

[Block element] == block = element block { attribute first { text }, attribute last { text }, attribute name { text }}

7. Complete schema

Finally, we can put our schema together:

[UCD RelaxNG schema] == [Namespace declaration] [schema: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

8. Examples

Here is a fragment of the full UCD (i.e. all properties expressed), for the Currency Symbols:

[Example 1] == <group age="1.1" na1="" isc="" gc="Sc" ccc="0" dt="no" dm="" nt="no" nv="" bc="ET" Bidi_M="N" bmg="" suc="" slc="" stc="" uc="" lc="" tc="" ccf="" scf="" fcf="" tcf="" jt="U" jg="" ea="N" lb="PR" sc="Zyyy" hst="NA"> <char cp="20A0" na="EURO-CURRENCY SIGN"/> <char cp="20A1" na="COLON SIGN"/> <char cp="20A2" na="CRUZEIRO SIGN"/> <char cp="20A3" na="FRENCH FRANC SIGN"/> <char cp="20A4" na="LIRA SIGN"/> <char cp="20A5" na="MILL SIGN"/> <char cp="20A6" na="NAIRA SIGN"/> <char cp="20A7" na="PESETA SIGN" lb="PO"/> <char cp="20A8" na="RUPEE SIGN" dt="com" dm="0052 0073"/> <char cp="20A9" na="WON SIGN" ea="H"/> <char cp="20AA" na="NEW SHEQEL SIGN"/> <char cp="20AB" age="2.0" na="DONG SIGN"/> <char cp="20AC" age="2.1" na="EURO SIGN" ea="A"/> <char cp="20AD" age="3.0" na="KIP SIGN"/> <char cp="20AE" age="3.0" na="TUGRIK SIGN"/> <char cp="20AF" age="3.0" na="DRACHMA SIGN"/> <char cp="20B0" age="3.2" na="GERMAN PENNY SIGN"/> <char cp="20B1" age="3.2" na="PESO SIGN"/> </group>

Typically, aligning the groups with the Unicode blocks leads to fairly compact data, as can be seen above. This is also helps spot the particularities of individual characters relative to their group: the non-usual linebreaking of U+20A7 PESETA SIGN, the non-usual East-Asian width of U+20AC EURO SIGN.

There are a few instances where a block has vastly different characters and breaking it in multiple groups makes for a much more readable XML representation. For example, isolating the C0 and C1 controls in their own groups, or isolating the noncharacters (especially those in the Arabic Presentation Forms-A block) is beneficial.

When the Unihan properties are not included in the XML representation, we get a fairly compact representation:

[Example 2] == <group first-cp="3400" last-cp="4DB5" type="char" age="3.0" na="CJK UNIFIED IDEOGRAPH-*" na1="" isc="" gc="Lo" ccc="0" dt="no" dm="" nt="no" nv="" bc="L" Bidi_M="N" bmg="" suc="" slc="" stc="" uc="" lc="" tc="" ccf="" scf="" fcf="" tcf="" jt="U" jg="" ea="W" lb="ID" sc="Hani" hst="NA"/>

Another interesting example is the beginning of the group for Hangul Syllables. Because the space concerns are not paramount in our representation, we can avoid all the “built-in” knowledge of those characters:

[Example 2] == <group age="2.0" na1="" isc="" gc="Lo" ccc="0" dt="can" nt="no" nv="" bc="L" Bidi_M="N" bmg="" suc="" slc="" stc="" uc="" lc="" tc="" ccf="" scf="" fcf="" tcf="" jt="U" jg="" ea="W" lb="ID" sc="Hang" hst="LVT"> <char cp="AC00" na="HANGUL SYLLABLE GA" dm="1100 1161" hst="LV"/> <char cp="AC01" na="HANGUL SYLLABLE GAG" dm="1100 1161 11A8"/> <char cp="AC02" na="HANGUL SYLLABLE GAGG" dm="1100 1161 11A9"/> <char cp="AC03" na="HANGUL SYLLABLE GAGS" dm="1100 1161 11AA"/> <char cp="AC04" na="HANGUL SYLLABLE GAN" dm="1100 1161 11AB"/>

Yet, the resulting XML files are wasteful on space. In fact, an experimental version of the 4.0.1 UCD without the Unihan properties is 1,923,716 bytes, while the corresponding UCD files are 2,306,655 bytes. Similarly, an experimental version with the Unihan properties is roughly equal in size to Unihan.txt itself.

Document History

Author: Eric Muller

Revision	Date	Comments
1	June 6, 2004	Initial version

L2/04-220