[Unicode]  Technical Reports
 

Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)
Part 6: Related Information

Version 24
Editors CLDR committee members
Date 2013-09-18
This Version http://www.unicode.org/reports/tr35/tr35-33/tr35.html
Previous Version http://www.unicode.org/reports/tr35/tr35-31/tr35.html
Latest Version http://www.unicode.org/reports/tr35/
Corrigenda http://unicode.org/cldr/corrigenda.html
Latest Proposed Update http://www.unicode.org/reports/tr35/proposed.html
Namespace http://cldr.unicode.org/
DTDs http://unicode.org/cldr/dtd/24/
Revision 33

Summary

This document describes parts of an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.

This is a partial document, describing only those parts of the LDML that are relevant for supplemental data. For the other parts of the LDML see the main LDML document and the links above.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Parts

The LDML specification is divided into the following parts:

Contents of Part 6, Related Information

Introduction Supplemental Data

The following represents the format for additional supplemental information. This is information that is important for internationalization and proper use of CLDR, but is not contained in the locale hierarchy. It is not localizable, nor is it overridden by locale data. The current CLDR data can be viewed in the Supplemental Charts.

The data in CLDR is presently split into multiple files: supplementalData.xml, supplementalMetadata.xml, characters.xml, likelySubtags.xml, ordinals.xml, plurals.xml, telephoneCodeData.xml, genderList.xml, plus transforms (see Section 5.16 Transforms and Appendix N: Transform Rules). The split is just for convenience: logically, they are treated as though they were a single file. Future versions of CLDR may split the data in a different fashion. Do not depend on any specific XML filename or path for supplemental data.

Note that Chapter 10 presents information about metadata that is maintained on a per-locale basis. It is included in this section because it is not intended to be used as part of the locale itself.

1 Territory Data

1.1 Supplemental Territory Containment

<!ELEMENT territoryContainment ( group* ) >
<!ELEMENT group EMPTY >
<!ATTLIST group type NMTOKEN #REQUIRED >
<!ATTLIST group contains NMTOKENS #IMPLIED >
<!ATTLIST group grouping ( true | false ) #IMPLIED >
<!ATTLIST group status ( deprecated, grouping ) #IMPLIED >

The following data provides information that shows groupings of countries (regions). The data is based on the [UNM49]. There is one special code, QO , which is used for outlying areas of Oceania that are typically uninhabited. The territory containment forms a tree with the following levels:

World

Continent

Subcontinent

Country/Region

For a chart showing the relationships (plus the included timezones), see the Territory Containment Chart. The XML structure has the following form.

<territoryContainment>
<group type="001" contains="002 009 019 142 150"/> <!--World -->
<group type="011" contains="BF BJ CI CV GH GM GN GW LR ML MR NE NG SH SL SN TG"/> <!--Western Africa -->
<group type="013" contains="BZ CR GT HN MX NI PA SV"/> <!--Central America -->
<group type="014" contains="BI DJ ER ET KE KM MG MU MW MZ RE RW SC SO TZ UG YT ZM ZW"/> <!--Eastern Africa -->
<group type="142" contains="030 035 062 145"/> <!--Asia -->
<group type="145" contains="AE AM AZ BH CY GE IL IQ JO KW LB OM PS QA SA SY TR YE"/> <!--Western Asia -->
<group type="015" contains="DZ EG EH LY MA SD TN"/> <!--Northern Africa -->
...

There are groupings that don't follow this regular structure, such as:

<group type="003" contains="013 021 029" grouping="true"/> <!--North America -->

These are marked with the attribute grouping="true".

When groupings have been deprecated but kept around for backwards compatibility, they are marked with the attribute status="deprecated", like this:

<group type="029" contains="AN" status="deprecated"/> <!--Caribbean -->

When the containment relationship itself is a grouping, it is marked with the attribute status="grouping", like this:

<group type="150" contains="EU" status="grouping"/> <!--Europe -->

That is, the type value isn’t a grouping, but if you filter out groupings you can drop this containment. In the example above, EU is a grouping, and contained in 150.

1.2 Supplemental Territory Information

<!ELEMENT territory ( languagePopulation* ) >
<!ATTLIST territory type NMTOKEN #REQUIRED >
<!ATTLIST territory gdp NMTOKEN #REQUIRED >
<!ATTLIST territory literacyPercent NMTOKEN #REQUIRED >
<!ATTLIST territory population NMTOKEN #REQUIRED >

<!ELEMENT languagePopulation EMPTY >
<!ATTLIST languagePopulation type NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation writingPercent NMTOKEN #IMPLIED >
<!ATTLIST languagePopulation populationPercent NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation officialStatus (de_facto_official | official | official_regional | official_minority) #IMPLIED >

This data provides testing information for language and territory populations. The main goal is to provide approximate figures for the literate, functional population for each language in each territory: that is, the population that is able to read and write each language, and is comfortable enough to use it with computers.

The GDP and Literacy figures are taken from the World Bank where available, otherwise supplemented by FactBook data and other sources. Much of the per-language data is taken from the Ethnologue, but is supplemented and processed using many other sources, including per-country census data. (The focus of the Ethnologue is native speakers, which includes people who are not literate, and excludes people who are functional second-language users.)

The percentages may add up to more than 100% due to multilingual populations, or may be less than 100% due to illiteracy or because the data has not yet been gathered or processed. Languages with a small population may be omitted.

2 Supplemental Language Data

<!ELEMENT languageData ( language* ) >
<!ELEMENT language EMPTY >
<!ATTLIST language type NMTOKEN #REQUIRED >
<!ATTLIST language scripts NMTOKENS #IMPLIED >
<!ATTLIST language territories NMTOKENS #IMPLIED >
<!ATTLIST language variants NMTOKENS #IMPLIED >
<!ATTLIST language alt NMTOKENS #IMPLIED >
 

The language data is used for consistency checking and testing. It provides a list of which languages are used with which scripts and in which countries. To a large extent, however, the territory list has been superseded by the territoryInfo data discussed below.

	<languageData>
		<language type="af" scripts="Latn" territories="ZA"/>
		<language type="am" scripts="Ethi" territories="ET"/>
		<language type="ar" scripts="Arab" territories="AE BH DZ EG IN IQ JO KW LB
LY MA OM PS QA SA SD SY TN YE"/>
                ...

If the language is not a modern language, or the script is not a modern script, or the language not a major language of the territory, then the alt attribute is set to secondary.

		<language type="fr" scripts="Latn" territories="IT US" alt="secondary" />
                ...

3 Supplemental Code Mapping

<!ELEMENT languageCodes EMPTY >
<!ATTLIST languageCodes type NMTOKEN #REQUIRED>
<!ATTLIST languageCodes alpha3 NMTOKEN #REQUIRED>

<!ELEMENT territoryCodes EMPTY >
<!ATTLIST territoryCodes type NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes numeric NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes alpha3 NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes fips10 NMTOKEN #IMPLIED>
<!ATTLIST territoryCodes internet NMTOKENS #IMPLIED>

The code mapping information provides mappings between the subtags used in the CLDR locale IDs (from BCP 47) and other coding systems or related information. The language codes are only provided for those codes that have two letters in BCP 47 to their ISO three-letter equivalents. The territory codes provide mappings to numeric (UN M.49 [UNM49] codes, equivalent to ISO numeric codes), ISO three-letter codes, FIPS 10 codes, and the internet top-level domain codes. The alphabetic codes are only provided where different from the type. For example:

<territoryCodes type="AA" numeric="958" alpha3="AAA"/>
<territoryCodes type="AD" numeric="020" alpha3="AND" fips10="AN"/>
<territoryCodes type="AE" numeric="784" alpha3="ARE"/>
...
<territoryCodes type="GB" numeric="826" alpha3="GBR" fips10="UK" internet="UK GB"/>
...
<territoryCodes type="QU" numeric="967" alpha3="QUU" internet="EU"/>

4 Telephone Code Data

<!ELEMENT telephoneCodeData ( codesByTerritory* ) >

<!ELEMENT codesByTerritory ( telephoneCountryCode+ ) >
<!ATTLIST codesByTerritory territory NMTOKEN #REQUIRED >

<!ELEMENT telephoneCountryCode EMPTY >
<!ATTLIST telephoneCountryCode code NMTOKEN #REQUIRED >
<!ATTLIST telephoneCountryCode from NMTOKEN #IMPLIED >
<!ATTLIST telephoneCountryCode to NMTOKEN #IMPLIED >

This data specifies the mapping between ITU telephone country codes [ITUE164] and CLDR-style territory codes (ISO 3166 2-letter codes or non-corresponding UN M.49 [UNM49] 3-digit codes). There are several things to note:

A subset of the telephone code data might look like the following (showing a past mapping change to illustrate the from and to attributes):

<codesByTerritory territory="001">
	<telephoneCountryCode code="800"/> <!-- International Freephone Service -->
	<telephoneCountryCode code="808"/> <!-- International Shared Cost Services (ISCS) -->
	<telephoneCountryCode code="870"/> <!-- Inmarsat Single Number Access Service (SNAC) -->
</codesByTerritory>
<codesByTerritory territory="AS"> <!-- American Samoa -->
	<telephoneCountryCode code="1" from="2004-10-02"/> <!-- +1 684 in North America Numbering Plan -->
	<telephoneCountryCode code="684" to="2005-04-02"/> <!-- +684 now a spare code -->
</codesByTerritory>
<codesByTerritory territory="CA">
	<telephoneCountryCode code="1"/> <!-- North America Numbering Plan -->
</codesByTerritory>

5 Postal Code Validation

<!ELEMENT postalCodeData (postCodeRegex*) >
<!ELEMENT postCodeRegex (#PCDATA) >
<!ATTLIST postCodeRegex territoryId NMTOKEN #REQUIRED>

The Postal Code regex information can be used to validate postal codes used in different countries. In some cases, the regex is quite simple, such as for Germany:

<postCodeRegex territoryId="DE" >\d{5}</postCodeRegex>

The US code is slightly more complicated, since there is an optional portion:

<postCodeRegex territoryId="US" >\d{5}([ \-]\d{4})?</postCodeRegex>

The most complicated currently is the UK.

6 Supplemental Character Fallback Data

<!ELEMENT characters ( character-fallback*) >

<!ELEMENT character-fallback ( character* ) >
<!ELEMENT character (substitute*) >
<!ATTLIST character value CDATA #REQUIRED >

<!ELEMENT substitute (#PCDATA) >

The characters element provides a way for non-Unicode systems, or systems that only support a subset of Unicode characters, to transform CLDR data. It gives a list of characters with alternative values that can be used if the main value is not available. For example:

<characters>
     <character-fallback>
	<character value = "ß">
		<substitute>ss</substitute>
	</character>
	<character value = "Ø">
		<substitute>Ö</substitute>
		<substitute>O</substitute>
	</character>
	<character value = "">
		<substitute>Pts</substitute>
	</character>
	<character value = "">
		<substitute>Fr.</substitute>
	</character>
     </character-fallback> 
</characters>

The ordering of the substitute elements indicates the preference among them.

That is, this data provides recommended fallbacks for use when a charset or supported repertoire does not contain a desired character. There is more than one possible fallback: the recommended usage is that when a character value is not in the desired repertoire the following process is used, whereby the first value that is wholly in the desired repertoire is used.

7 Coverage Levels

The following describes the coverage levels used for the current version of CLDR. This list will change between releases of CLDR. Each level adds to what is in the lower level.

Level
Description
0
undetermined Does not meet any of the following levels.
10
core The CLDR "core" data, which is defined as the basic information about the language and writing system that is required before other information can be added using the CLDR survey tool. See http://cldr.unicode.org/index/cldr-spec/minimaldata
20
posix The minimum amount of data necessary in order to create a POSIX style locale from CLDR data. For example, only one country name, only one currency symbol, and so on.
30
minimal The minimum amount of locale data deemed necessary to create a "viable" locale in CLDR. Contains names for the languages, scripts, and territories associated with the language, numbering systems used in those languages, date and number formats, plus a few key values such as the values in Section 3.1 Unknown or Invalid Identifiers. See http://cldr.unicode.org/index/cldr-spec/minimaldata for a detailed list of the minimal data requirements.
40
basic Contains data associated with the most prominent languages and countries.
60
moderate Contains more types of data and more language and territory names than the basic level. If the language is associated with an EU country, then the moderate level attempts to complete the data as it pertains to all EU member countries.
80
modern Contains all fields in normal modern use, including all country names, and currencies in use.
100
comprehensive Contains complete localizations (or valid inheritance) for every possible field.
101
optional Fields that are not typically in use, or are deprecated.

Levels 40 through 80 are based on the definitions and specifications listed in 8.1-8.4. However, these principles are continually being refined by the CLDR technical committee, and so do not completely reflect the data that is actually used for coverage determination, which is under the XPath //supplementalData/CoverageLevels. For a view of the trunk version of this datafile, see coverageLevels.xml. (As described in the introduction to Supplemental Data, the specific XML filename may change.)

<!ELEMENT coverageLevels ( coverageVariable*, coverageLevel* ) >
<!ELEMENT coverageLevel EMPTY >
<!ATTLIST coverageLevel inLanguage CDATA #IMPLIED >
<!ATTLIST coverageLevel inScript CDATA #IMPLIED >
<!ATTLIST coverageLevel inTerritory CDATA #IMPLIED >
<!ATTLIST coverageLevel value CDATA #REQUIRED >
<!ATTLIST coverageLevel match CDATA #REQUIRED >

For example, here is an example coverageLevel line.

<coverageLevel
value="30" inLanguage="(de|fi)"
match="localeDisplayNames/types/type[@type='phonebook'][@key='collation']"/>

The coverageLevel elements are read in order, and the first match results in a coverage level value. The element matches based on the inLanguage, inScript, inTerritory, and match attribute values, which are regular expressions. For example, in the above example, a match occurs if the language is de or fi, and if the path is a locale display name for collation=phonebook.

The match attribute value logically has "//ldml/" prefixed before it is applied. In addition, the "[@" is automatically quoted. Otherwise standard Perl/Java style regular expression syntax is used.

<!ELEMENT coverageVariable EMPTY >
<!ATTLIST coverageVariable key CDATA #REQUIRED >
<!ATTLIST coverageVariable value CDATA #REQUIRED >

The coverageVariable element allows us to create variables for certain regular expressions that are used frequently in the coverageLevel definitions above. Each coverage varible must contain a key / value pair of attributes, which can then be used to be substituted into a coverageLevel definition above.

For example, here is an example coverageLevel line using coverageVariable substitution.

<coverageVariable key="%dayTypes" value="(sun|mon|tue|wed|thu|fri|sat)">
<coverageVariable key="%wideAbbr" value="(wide|abbreviated)">
<coverageLevel value="20" match="dates/calendars/calendar[@type='gregorian']/days/dayContext[@type='format']/dayWidth[@type='%wideAbbr']/day[@type='%dayTypes']"/>

In this example, the coverge variables %dayTypes and %wideAbbr are used to substitute their respective values into the match expression. This allows us to reuse the same variable for other coverageLevel matches that use the same regular expression fragment.

7.1 Definitions

7.2 Data Requirements

The required data to qualify for the level is then the following.

  1. localeDisplayNames
    1. languages: localized names for all languages in Language-List.
    2. scripts: localized names for all scripts in Script-List.
    3. territories: localized names for all territories in Territory-List.
    4. variants, keys, types: localized names for any in use in Target-Territories; for example, a translation for PHONEBOOK in a German locale.
  2. dates: all of the following for each calendar in Calendar-List.
    1. calendars: localized names
    2. month names, day names, era names, and quarter names
      • context=format and width=narrow, wide, & abbreviated
      • plus context=standAlone and width=narrow, wide, & abbreviated, if the grammatical forms of these are different than for context=format.
    3. week: minDays, firstDay, weekendStart, weekendEnd
      • if some of these vary in territories in Territory-List, include territory locales for those that do.
    4. am, pm, eraNames, eraAbbr
    5. dateFormat, timeFormat: full, long, medium, short
    6. intervalFormatFallback

  3. numbers: symbols, decimalFormats, scientificFormats, percentFormats, currencyFormats for each number system in Number-System-List.
  4. currencies: displayNames and symbol for all currencies in Currency-List, for all plural forms
  5. transforms: (moderate and above) transliteration between Latin and each other script in Target-Scripts.

7.3 Default Values

Items should only be included if they are not the same as the default, which is:

8 Supplemental Metadata

Note that this section discusses the <metadata> element within the <supplementalData> element. For the per-locale metadata used in tests and the Survey Tool, see 10: Locale Metadata Element.

The supplemental metadata contains information about the CLDR file itself, used to test validity and provide information for locale inheritance. A number of these elements are described in

8.1 Supplemental Alias Information

<!ELEMENT alias ( languageAlias*, scriptAlias*, territoryAlias*, variantAlias*, zoneAlias* ) >

<!ELEMENT languageAlias EMPTY >
<!ATTLIST languageAlias type NMTOKEN #IMPLIED >
<!ATTLIST languageAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT scriptAlias EMPTY >
<!ATTLIST scriptAlias type NMTOKEN #IMPLIED >
<!ATTLIST scriptAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT territoryAlias EMPTY >
<!ATTLIST territoryAlias type NMTOKEN #IMPLIED >
<!ATTLIST territoryAlias replacement NMTOKENS #IMPLIED >

<!ELEMENT variantAlias EMPTY >
<!ATTLIST variantAlias type NMTOKEN #IMPLIED >
<!ATTLIST variantAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT zoneAlias EMPTY >
<!ATTLIST zoneAlias type CDATA #IMPLIED >
<!ATTLIST zoneAlias replacement CDATA #IMPLIED >
 

This element provides information as to parts of locale IDs that should be substituted when accessing CLDR data. This logical substitution should be done to both the locale id, and to any lookup for display names of languages, territories, and so on. As with the display names, the language type and replacement may be any prefix of a valid locale id, such as "no_NO".

<alias>
  <languageAlias type="in" replacement="id">
  <languageAlias type="sh" replacement="sr">
  <languageAlias type="sh_YU" replacement="sr_Latn_YU">
...
  <territoryAlias type="BU" replacement="MM">
...
</alias>

8.2 Supplemental Deprecated Information

<!ELEMENT deprecated ( deprecatedItems* ) >
<!ATTLIST deprecated draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED > <!-- true and false are deprecated. -->

<!ELEMENT deprecatedItems EMPTY >
<!ATTLIST deprecatedItems draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED > <!-- true and false are deprecated. -->
<!ATTLIST deprecatedItems type ( standard | supplemental | ldml | supplementalData | ldmlBCP47 ) #IMPLIED > <!-- standard | supplemental are deprecated -->
<!ATTLIST deprecatedItems elements NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems attributes NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems values CDATA #IMPLIED >

The deprecated items can be used to indicate elements, attributes, and attribute values that are deprecated. This means that the items are valid, but that their usage is strongly discouraged. When the same deprecatedItems element contains combinations of elements, attributes, and values, then the "least significant" items are only deprecated if they occur with the "more significant" items. For example:

Deprecated Items
<deprecatedItems elements="A B"> A and B are deprecated
<deprecatedItems attributes="C D"> C and D are deprecated on all elements
<deprecatedItems elements="A B" attributes="C D"> C and D are deprecated, but only if they occur on elements A or B.
<deprecatedItems elements="A B" attributes="C D" values="E"> E is deprecated, but only if it is a value of C in an element A or B

In each case, multiple items are space-delimited.

Where particular values are deprecated (such as territory codes like SU for Soviet Union), the names for such codes may be removed from the common/main translated data after some period of time. However, typically supplemental information for deprecated codes is retained, such as containment, likely subtags, older currency codes usage, etc. The English name may also be retained, for debugging purposes.

8.3 Default Content

<!ELEMENT defaultContent EMPTY >
          <!ATTLIST defaultContent locales NMTOKENS #IMPLIED >

In CLDR, locales without territory information (or where needed, script information) provide data appropriate for what is called the default content locale. For example, the en locale contains data appropriate for en-US, while the zh locale contains content for zh-Hans-CN, and the zh-Hant locale contains content for zh-Hant-TW. The default content locales themselves thus inherit all of their contents, and are empty.

The choice of content is typically based on the largest literate population of the possible choices. Thus if an implementation only provides the base language (such as en), it will still get a complete and consistent set of data appropriate for a locale which is reasonably likely to be the one meant. Where other information is available, such as independent country information, that information can always be used to pick a different locale (such as en-CA for a website targeted at Canadian users).

If an implementation is to use a different default locale, then the data needs to be pivoted; all of the data from the CLDR for the current default locale pushed out to the locales that inherit from it, then the new default content locale's data moved into the base. There are tools in CLDR to perform this operation.

9 Locale Metadata Elements

Note: This section refers to the per-locale <metadata> element, containing metadata about a particular locale. This is in contrast to the Supplemental Metadata, which is in the supplemental tree and is not specific to a locale.

<!ELEMENT metadata (casingData?) >
<!ELEMENT casingData (casingItem*) >
<!ELEMENT casingItem ( #PCDATA ) >
<!ATTLIST casingItem type CDATA #REQUIRED >
<!ATTLIST casingItem override (true | false) #IMPLIED >

The <metadata> element contains metadata about the locale for use by the Survey Tool or other tools in checking locale data; this data is not intended for export as part of the locale itself.

The <casingItem> element specifies the capitalization intended for the majority of the data in a given category with the locale. The purpose is so that warnings can be issued to translators that anything deviating from that capitalization should be carefully reviewed. Its type attribute has one of the values used for the <contextTransformUsage> element above, with the exception of the special value "all"; its value is one of the following:

The <casingItem> data is generated by a tool based on the data available in CLDR. In cases where the generated casing information is incorrect and needs to be manually edited, the override attribute is set to "true" so that the tool will not override the manual edits.