|
|
| Version | 1.6.1 |
| Authors | Mark Davis (markdavis@google.com) |
| Date | 2008-07-25 |
| This Version | http://unicode.org/reports/tr35/tr35-11.html |
| Previous Version | http://unicode.org/reports/tr35/tr35-10.html |
| Latest Version | http://unicode.org/reports/tr35/ |
| Corrigenda | http://unicode.org/cldr/corrigenda.html |
| Latest Working Draft | http://unicode.org/draft/reports/tr35/tr35.html |
| Namespace | http://unicode.org/cldr/ |
| DTDs |
http://unicode.org/cldr/dtd/1.6/ldml.dtd http://unicode.org/cldr/dtd/1.6/ldmlSupplemental.dtd |
| Revision | 11 |
This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For possible errata for this document, see [Errata].
Appendix A: Sample Special Elements
Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.
The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)
Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings (see [UCA]). Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository project page [LocaleProject].
There are many ways to use the Unicode LDML format and the data in CLDR, and the Unicode Consortium does not restrict the ways in which the format or data are used. However, an implementation may also claim conformance to LDML or to CLDR, as follows:
UAX35-C1. An implementation that claims conformance to this specification shall:
UAX35-C2. An implementation that claims conformance to Unicode locale or language identifiers shall:
External specifications may also reference particular components of Unicode locale or language identifiers, such as:
Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.
Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries, and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Appendix D: Language and Locale IDs.
We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.
Unicode LDML uses stable identifiers for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.
A Unicode language identifier has the following structure:
unicode_language_id:=root
|(unicode_language_subtag
([-_] unicode_script_subtag)?
([-_] unicode_region_subtag)?
([-_] unicode_variant_subtag)*)unicode_language_subtag:=BCP47_language_subtag
| ISO_639_3_code
| ISO_639_5_codeunicode_script_subtag:=BCP47_script_subtagunicode_region_subtag:=BCP47_region_subtagunicode_variant_subtag:=BCP47_variant_subtag
| grandfathered_variant_subtagsAs usual, x? means that x is optional; x* means that x occurs zero or more times.
For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all Unicode language identifiers.
As for terminology, the term code may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the base language code. For example, the base language code for "en-US" (American English) is "en" (English).
A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure:
unicode_locale_id:=unicode_language_id
(unicode_locale_extensions)?unicode_locale_extensions:="@" key "=" type
(";" key "=" type )*
For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see Appendix D: Language and Locale IDs. Note that the type may also be referred to as a key-value
The Unicode language identifier is based on [BCP47]. However, it differs in the following ways:
The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent. All identifier field values are case-insensitive, except for the type, which is case-sensitive. However, customarily the language subtag is in lower case, the territory and variant subtags are in upper case, the script subtag is title case (that is, the first character is upper case and other characters are lower case), and variants are upper case. These conventions are used in the CLDR file names, which may be case-sensitive depending on the operating system. The normal form of a locale ID in the CLDR data uses "_". Implementations can choose an alternate canonical form in terms of casing and separator characters.
Customarily the currency IDs are upper case and time zone IDs are title cased by field (as defined in the time zone database); other key and type subtags are lower case.
The Unicode language and locale identifier field values are given in the following table. Note that some private-use field values may be given specific values.
| Field | Allowable Characters | Allowable values | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unicode_language_subtag (also known as a Unicode base language code) |
ASCII letters | [BCP47] subtag values marked as Type: language ISO 639-3 and ISO 639-5 codes are also allowed, where they do not have ISO 639-1 equivalents. At publication time, these were slated to be added to the next version of [BCP47], but it is unclear when that version will be approved. ISO 639-3 introduces the notion of "macrolanguages", where certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and additional codes are given for the narrower semantics. For backwards compatibility, Unicode language identifiers retain use of the narrower semantics for these codes. That is, the following table lists these cases:
Thus Unicode language identifiers use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese (Taiwan), not "cmn-TW". The private use codes from qfz..qtz will never be used by Unicode identifiers, and are thus safe for use for other purposes by applications. |
|||||||||||||||||||||
| unicode_script_subtag (also known as a Unicode script code) |
ASCII letters | [BCP47] subtag values marked as Type: script In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:
Unicode allows for the use of the Unicode Script values [UAX24]:
The private use subtags from Qaaq..Qabx will never be used by Unicode identifiers, and are thus safe for use for other purposes by applications. |
|||||||||||||||||||||
| unicode_region_subtag (also known as a Unicode region code, or a Unicode territory code) |
ASCII letters, numbers | [BCP47] subtag values marked as Type: region, or any UN
M.49 [UNM49] code that does not correspond to a [BCP47]
region subtag. There are two types of region subtags:
There are three private use subtags defined for Unicode identifiers:
The private use subtags from XA..XZ will never be used by Unicode identifiers, and are thus safe for use for other purposes by applications. |
|||||||||||||||||||||
| unicode_variant_subtag (also known as a Unicode language variant code) |
ASCII letters | Values used in Unicode are discussed below. For information on the process for adding new standard variants or element/type pairs, see [LocaleProject]. | |||||||||||||||||||||
| key | ASCII letters and digits | ||||||||||||||||||||||
| type | ASCII letters, and digits, and "-" |
Examples:
en fr_BE de_DE@collation=phonebook;currency=DDM
A locale that only has a language subtag (and optionally a script subtag) is called a language locale; one with both language and territory subtag is called a territory locale (or country locale).
The variant codes specify particular variants of the locale, typically with special options. They cannot overlap with script or territory codes, so they must have more than four letters. The currently defined variants include:
| variant | Description |
|---|---|
| <BCP47 variants> | As defined in [BCP47], plus: |
| BOKMAL | Bokmål, variant of Norwegian (deprecated: use nb) |
| NYNORSK | Nynorsk, variant of Norwegian (deprecated: use nn) |
| AALAND | Åland, variant of Swedish used in Finland (deprecated: use AX) |
| POSIX | A POSIX-style invariant locale. |
| REVISED | For revised orthography |
| SAAHO | The Saaho variant of Afar |
Note: The first two of the above variants are for backwards compatibility. Typically the entire contents of these are defined by an <alias> element pointing at nb_NO (Norwegian Bokmål) and nn_NO (Norwegian Nynorsk) locale IDs. See also Appendix K: Valid Attribute Values.
The currently defined optional key/type combinations include the following. Additional type values are defined in the detail sections of this document or in Appendix K: Valid Attribute Values. The assignment of values needs to ensure that they are unique if truncated to eight letters and digits.
| key | type | Description |
|---|---|---|
| collation | standard | The default ordering for each language. For root it is [UCA] order; for each other locale it is the same as UCA ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they only have effect in those locales. |
| phonebook | For a phonebook-style ordering (used in German). | |
| pinyin | Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) | |
| traditional | For a traditional-style sort (as in Spanish) | |
| stroke | Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) | |
| direct | Hindi variant | |
| posix | A "C"-based locale. (no longer in CLDR data) | |
| big5han | Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese) | |
| gb2312han | Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese) | |
| unihan | Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese) | |
| calendar (For information on the calendar algorithms associated with the data used with the above types, see [Calendars].) |
gregorian | (default) |
| islamic
alias: arabic |
Astronomical Arabic | |
| chinese | Traditional Chinese calendar | |
| islamic-civil
alias: civil-arabic |
Civil (algorithmic) Arabic calendar | |
| hebrew | Traditional Hebrew Calendar | |
| japanese | Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor) | |
| buddhist
alias: thai-buddhist |
Thai Buddhist Calendar (same as Gregorian except for the year) | |
| persian | Persian Calendar | |
| coptic | Coptic Calendar | |
| ethiopic | Ethiopic Calendar | |
collation parameters:
|
Associated values as defined in: 5.14.1 <collation> | Semantics as defined in: 5.14.1 <collation> |
| currency (also known as a Unicode currency code) |
ISO 4217 code, plus others in common use |
Currency value identified by ISO 4217 code, plus others in common use.
Also uses XXX as Unknown or Invalid Currency. See Appendix K: Valid Attribute Values and also [Data Formats] |
| time zone (also known as a Unicode time zone code) |
TZID, plus the value: Etc/Unknown |
Identification for time zone according to the TZ Database,
plus the value Etc/Unknown. Unicode LDML supports all of the time zone IDs by mapping all equivalent time zone IDs to a canonical ID for translation. This canonical time zone ID is not the same as the zone.tab time zone ID found in [Olson]. For more information, see Section 5.9.2 Time Zone Names, Appendix F: Date Format Patterns, and Appendix J: Time Zone Display Names. |
For more information on the allowed attribute values, see the specific elements below, and Appendix K: Valid Attribute Values.
The following identifiers are used to indicate an unknown or invalid code in Unicode language and locale identifiers. For Unicode identifiers, the Region code uses a private use ISO 3166 code, and Time Zone code uses an additional code; the others are defined by the relevant standards. When these codes are used in APIs connected with Unicode identifiers, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.
| Code Type | Value | Description in Referenced Standards |
|---|---|---|
| Language | und |
Undetermined language |
| Script | Zzzz |
Code for uncoded script, Unknown [UAX24] |
| Region | ZZ |
Unknown or Invalid Territory |
| Currency | XXX |
The codes assigned for transactions where no currency is involved |
| Time Zone | Etc/Unknown |
Unknown or Invalid Time Zone |
When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.
For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092) . Unicode identifiers supply a standard mappings to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:
| Region | UN/ISO Numeric | ISO 3-Letter |
|---|---|---|
AA |
958 |
AAA |
QM..QZ |
959..972 |
QMM..QZZ |
XA..XZ |
973..998 |
XAA..XZZ |
ZZ |
999 |
ZZZ |
For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):
| Script | Numeric |
|---|---|
Qaaa..Qabx |
900..949 |
Unicode language and locale IDs can be converted to valid [BCP47] language tags by performing the following transformation.
Thus for example, we get the following conversion:
| Unicode | en_US_POSIX@calendar=islamic;collation=traditional;colStrength=secondary |
| [BCP47] | en-US-x-ldml-POSIX-k-calendar-islamic-k-collation-traditio-k-colStren-secondar |
The locale id format generally follows the description in the OpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from the those guidelines are that the locale id:
The XML format relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is the default Unicode Collation Algorithm order (see [UCA]). Since English language collation has the same ordering, the 'en' locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.
Given a particular locale id "en_IE_someVariant", the search chain for a particular resource is the following.
en_IE_someVariant en_IE en root
If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.
Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information see Appendix P.3 Default Content.
If a language has more than one script in customary modern use, then the CLDR file structure in common/main follows the following model:
lang
lang_script
lang_script_region
lang_region (aliases to lang_script_region)
There are actually two different kinds of fallback: resource bundle lookup and resource item lookup. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like a the translated name for the region "CN" in Breton. These are closely related, but distinct, processes. Below "key" stands for zero or more key/type pairs.
|
Lookup Type |
Example |
Comments |
|---|---|---|
|
Resource bundle lookup |
se-FI → se |
* default may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded
by inserting the chain, resulting in: |
|
Resource item lookup |
se-FI+key → se+key |
* if there is a root_alias to another key
or locale, then insert that entire chain. For example, suppose that
months for another calendar system have a root alias to Gregorian
months. In that case, the root alias would change the key, and retry
from se-FI downward. |
The fallback is a bit different for these two cases; internal aliases and keys are are not involved in the bundle lookup, and the default locale is not involved in the item lookup. Moreover, the resource item lookup must remain stable, because the resources are built with a certain fallback in mind; changing the core fallback order can render the bundle structure incoherent. Resource bundle lookup, on the other hand, is more flexible; changes in the view of the "best" match between the input request and the output bundle are more tolerant, when represent overall improvements for users. For more information, see Section 5.3.1 Fallback_Elements.
Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.
For a more complete description of how inheritance applies to data, and the use of keywords, see Appendix I: Inheritance and Validity.
The locale data does not contain general character properties that are derived from the Unicode Character Database [UCD]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
Warning: If a locale has a different script than its parent (for example, sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.
In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist calendar inherits from the Gregorian calendar. This only happens where documented in this specification. In these special cases, the inheritance functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the alternate values.
For example, for the locale "en_US" the month data in <calendar class="buddhist"> inherits first from <calendar class="buddhist"> in "en", then in "root". If not found there, then it inherits from <calendar type="gregorian"> in "en_US", then "en", then in "root".
There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.
For example, the language-dependent data for Japanese in CLDR is present in the following files:
The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file.
Supplemental data relating to Japan or the Japanese writing system can be found in:
The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the DTD, listed at the top of this document; however, the DTD does not describe all the constraints on the structure.
To start with, the root element is <ldml>, with the following DTD entry:
<!ELEMENT ldml (identity, (alias |(fallback?, localeDisplayNames?, layout?, characters?, delimiters?, measurement?, dates?, numbers?, units?, collations?, posix?, segmentations?, references?, special*))) >That element contains the following elements:
The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more involved.
The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information.
In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.
There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:
Structure elements do not have this restriction, but also do not inherit, except as an entire block. The structure elements are listed in serialElements in the supplemental metadata. See also Appendix I: Inheritance and Validity.
Note that the data in examples given below is purely illustrative, and does not match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.
In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:
<languages>
<language type="aa">...</language>
<language type="aa">..</language>
There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an alt element).
In general, data should be in NFC format. Exceptions to this include transforms, segmentations, and pc/sc/tc/qc/ic rules in collation. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining slash (U+0338 COMBINING LONG SOLIDUS OVERLAY).
Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters, and that leading and trailing spaces are to be ignored (that is, they behave like NMTOKENS). These include:
At any level in any element, two special elements are allowed.
<special xmlns:yyy="xxx">
This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML namespace of the special data. For example, the following used the version 1.0 POSIX special element.
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.0/ldml.dtd" [ <!ENTITY % posix SYSTEM "http://unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd"> %posix; ]> <ldml> ... <special xmlns:posix="http://www.opengroup.org/regproducts/xu.htm"> <!-- old abbreviations for pre-GUI days --> <posix:messages> <posix:yesstr>Yes</posix:yesstr> <posix:nostr>No</posix:nostr> <posix:yesexpr>^[Yy].*</posix:yesexpr> <posix:noexpr>^[Nn].*</posix:noexpr> </posix:messages> </special> </ldml>
<!ELEMENT alias (special*) >
<!ATTLIST alias source NMTOKEN #REQUIRED >
<!ATTLIST alias path CDATA #IMPLIED>
The contents of any element can be replaced by an alias, which points to another source for the data. The elements in that source (a locale ID) are to be fetched from the corresponding location in the other source based on the path. Normal resource searching is to be used; take the following example:
<ldml>
<collations>
<collation type="phonebook">
<alias source="de_DE">
</collation>
</collations>
</ldml>
The resource bundle at "de_DE" will be searched for a resource element at the same position in the tree with type "collation". If not found there, then the resource bundle at "de" will be searched, and so on. For an example of how this works with inheritance, look at the following table (where green indicates inherited items). Note in particular that an alias "reroutes" the inheritance; nothing in the parent affects the contents of an item with an alias. Thus the red item below is blocked.
| en | en_US | Resolved | ||
|---|---|---|---|---|
<x> |
<x> |
<x> |
||
| de | de_DE | Resolved | de_DE_1901 | Resolved |
<x> |
<x> |
<x> |
<x> |
<x> |
If the path attribute is present, then its value is an [XPath] that points to a different node in the tree. For example:
<alias source="root" path="../monthWidth[@type='wide']"/>
The default value if the path is not present is the same position in the tree. All of the attributes in the [XPath] must be distinguishing elements. For more details, see Appendix I: Inheritance and Validity.
There is a special value for the source attribute, the constant source="locale". This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:
| Root | de | Resolved |
|---|---|---|
<x> |
<x> |
<x> |
<y> |
<y> |
<y> |
The first row shows the inheritance within the <x> element, whereby <c> is inherited from root. The second shows the inheritance within the <y> element, whereby <a>, <c>, and <d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.
For more details on data resolution, see Appendix I: Inheritance and Validity.
It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and multiple inheritance) can be followed indefinitely without terminating.
<displayName>
Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.
<numberFormat>
<displayName>Prozentformat</displayName>
...
<numberFormat>
Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].
<default type="someID"/>
In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of the type attribute is to match the value of the type attribute for the selected item.
<timeFormats> <default type="medium" /> <timeFormatLength type="full"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a z</pattern> </timeFormat> </timeFormatLength> <timeFormatLength type="long"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a z</pattern> </timeFormat> </timeFormatLength> <timeFormatLength type="medium"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a</pattern> </timeFormat> </timeFormatLength> ...
Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in fr, and that in fr_BE we have the following:
<timeFormats>
<default type="long"/>
</timeFormats>
In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:
<timeFormatLength type="medium"> <timeFormat type="standard"> <pattern type="standard">...</pattern> </timeFormat> </timeFormatLength>
In this case, the <default> is inherited from fr, and has the value "medium". It thus refers to this new "medium" pattern in this resource bundle.
Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content. These escapes are only allowed in certain elements, according to the DTD.
| Code Point | XML Example |
|---|---|
U+0000 |
<cp hex="0"> |
The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.
For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.
Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.
<... type="stroke" ...>
The attribute type is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element. For example:
<ldml>
...
<currencies>
<currency>...</currency>
<currency type="preEuro">...</currency>
</currencies>
</ldml>
<... draft="unconfirmed" ...>
If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary draft value), as per the following:
For more information on precisely how these values are computed for any given release, see Data Submission and Vetting Process on the CLDR website.
Normally draft attributes should only occur on "leaf" elements. For a more formal description of how elements are inherited, and what their draft status is, see Appendix I: Inheritance and Validity.
<... alt="descriptor" ...>
This attribute labels an alternative value for an element. The descriptor indicates what kind of alternative it is, and takes one of the following forms:
"proposed" should only be present if the draft status is not "approved". It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as alt="proposed" until it is vetted.
... <month type="9">Settembru</month> <month type="9" draft="unconfirmed" alt="proposed">Settembro</month> <month type="10">...
Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:
... <month type="9" draft="unconfirmed" alt="proposed2">Settembre</month> ...
The allowable values for variantname at this time are "variant", "list", "email", "www", and "secondary". This may be expanded in the future.
<... validSubLocales="de_AT de_CH de_DE" ...>
The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there is not one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements would not otherwise be draft.
For a more complete description of how draft applies to data, see Appendix I: Inheritance and Validity.
<... standard="..." ...>
Note: This attribute is deprecated. Instead, use a reference element with the attribute standard="true". See Section 5.13 <references>.
The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:
<collation standard="MSA 200:2002">
...
<dateFormatStyle standard=”http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30”>
<... references="..." ...>
The value of this attribute is a token representing a reference for the information in the element, including standards that it may conform to. See Section 5.13 <references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)
Example:
<territory type="UM" references="R222">USAs yttre öar</territory>
The reference element may be inherited. Thus, for example, R222 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.
<... allow="verbatim" ...>
In certain circumstances, one or more elements do not follow the rule of the majority. as indicated by the inText element. In this case, the allow attribute
is used:
The example below indicates that variant names are normally lower case with one exception.
<inText type="languages">lowercase-words</inText>
<variants>
<variant type="1901">ortografia tradizionale tedesca</variant>
<variant type="1996">ortografia tedesca del 1996</variant>
<variant type="NEDIS" allow="verbatim">dialetto del Natisone</variant>
</variants>
When attribute specify date ranges, it is usually done with attributes from and to. The from attribute specifies the starting point, and the to attribute specifies the end point. In some cases, the attribute is time, and the element itself specifies whether it is equivalent to a from or a to. For example, this is done with the weekEndStart and weekEndEnd elements. xxxx
The data format is a restricted ISO 8601 format, restricted to the fields year, month, day, hour, minute, and second in that order, with "-" used as a separator between date fields, a space used as the separator between the date and the time fields, and ":" used as a separator between the time fields. If the minute or minute and second are absent, they are interpreted as zero. If the hour is also missing, then it is interpreted based on whether the attribute is from or to.
from defaults to "00:00:00" (midnight at the start of the day).
to defaults to "24:00:00" (midnight at the end of the day).
That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00. Thus when the hour is missing, the from and to are interpreted inclusively: the range includes all of the day mentioned.
For example, the following are equivalent:
| <usesMetazone from="1991-10-27" to="2006-04-02" .../> |
| <usesMetazone from="1991-10-27 00:00:00" to="2006-04-02 24:00:00" .../> |
| <usesMetazone from="1991-10-26 24:00:00" to="2006-04-03 00:00:00" .../> |
as are the following:
| <weekendStart day="sat"/> <weekendEnd day="sun"/> |
| <weekendStart day="sat" time="00:00"/> <weekendEnd day="sun" time="24:00"/> |
| <weekendStart day="fri" time="24:00"/> <weekendEnd day="mon" time="00:00"/> |
If the from element is missing, it is assumed to be as far backwards in time as there is data for; if the to element is missing, then it is from this point onwards, with no known end point.
The dates and times are specified in local time, unless otherwise noted. (In particular, the metazone values are in UTC (also known as GMT).
<!ELEMENT identity (alias | (version, generation, language, script?, territory?, variant?, special*) ) >
The identity element contains information identifying the target locale for this data, and general information about the version of this data.
<version number="$Revision: 1.227 $">
The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. For example:
<version number="1.1">Various notes and changes in version 1.1</version>This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.
<generation date="$Date: 2007/07/17 23:41:16 $" />
The generation element contains the last modified date for the data. This can be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).
<language type="en"/>
The language code is the primary part of the specification of the locale id, with values as described above.
<script type="Latn" />
The script code may be used in the identification of written languages, with values described above.
<territory type="US"/>
The territory code is a common part of the specification of the locale id, with values as described above.
<variant type="NYNORSK"/>
The variant code is the tertiary part of the specification of the locale id, with values as described above.
When combined according to the rules described in Section 3, Unicode Language and Locale Identifiers, the language element, along with any of the optional script, territory, and variant elements, must identify a known, stable locale identifier. Otherwise, it is an error.
<!ELEMENT fallback (#PCDATA) >
<!ATTLIST fallback draft ( approved | contributed | provisional | unconfirmed ) #IMPLIED >
<!ATTLIST fallback references CDATA #IMPLIED >
In many cases, there may not be full data for a particular locale. The fallback element provides a mechanism to indicate what the best available fallback locales would be. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Кітай" .
The contents are a list of locales (languages) in priority order, separated by spaces.
When a fallback is used for resource item lookup, the normal order of inheritance is used for resource item lookup, except that before using any data from Root, the data for the fallback locales would be used if available. That is, normally we get the following fallback:
sms-FI → sms → root.
If the fallback values are (se-FI fi-FI), then instead the inheritance is:
sms-FI → sms → se_FI → se → fi_FI → fi → root
The fallback list is only a default list. It is recommended that any implementation provide a mechanism for overriding the fallbacks, by allowing users to specify a language priority list of acceptable languages, instead of just a single language. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, I might choose "gsw, de, fr" as my list of languages, skipping Italian because my comprehension is not good enough for arbitrary content. An example of such a list is the Accept-Language list supplied by browsers.
With such information, the fallback list would not be used. Instead, the priority list would be used both for bundle fallback and for item fallback instead of root (see Section 4, Locale Inheritance).
<!ELEMENT localeDisplayNames (alias | (localeDisplayPattern?, languages?, scripts?, territories?, variants?, keys?, types?, measurementSystemNames?, codePatterns?, special*)) >
Display names for scripts, languages, countries, currencies, and variants in this locale are supplied by this element. They supply localized names for these items for use in user-interfaces for displaying menu lists. In addition, the localized names for currency items may also be suitable for use in user-interfaces involving flowing text. Examples are given below.
Note: The "en" locale may contain translated names for deprecated codes for debugging purposes. Translation of deprecated codes into other languages is discouraged.
Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.)
Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].
<localeDisplayPattern>
For compound language (locale) IDs such as "pt_BR" which contain additional subtags beyond the initial language code: When the <languages> data does not explicitly specify a display name such as "Brazilian Portuguese" for a given compound language ID, this element specifies how to assemble a fallback display name such as "Portuguese (Brazil)" from the display names of the subtags.
It includes two sub-elements:
<languages>
This contains a list of elements that provide the user-translated names for language codes, as described in Section 3, Unicode Language and Locale Identifiers.
<language type="ab">Abkhazian</language> <language type="aa">Afar</language> <language type="af">Afrikaans</language> <language type="sq">Albanian</language>
The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition.
| type | translation | composition |
|---|---|---|
| nl_BE | Flemish | Dutch (Belgium) |
| zh_Hans | Simplified Chinese | Chinese (Simplified Han) |
| en_GB | British English | English (United Kingdom) |
Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields (if any) added using composition.
<scripts>
This element can contain an number of script elements. Each script element provides the localized name for a script code, as described in Section 3, Unicode Language and Locale Identifiers (see also UAX #24: Script Names [Scripts]). For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica". That would be expressed with the following.
<script type="Latn">Romana</script>
<script type="Cyrl">Kyrillica</script>
<territories>
This contains a list of elements that provide the user-translated names for territory codes, as described in Section 3, Unicode Language and Locale Identifiers.
<territory type="AF">Afghanistan</territory>
<territory type="AL">Albania</territory>
<territory type="DZ">Algeria</territory>
<territory type="AD">Andorra</territory>
<territory type="AO">Angola</territory>
<territory type="US">United States</territory>
<variants>
This contains a list of elements that provide the user-translated names for the variant_code values described in Section 3, Unicode Language and Locale Identifiers.
<variant type="nynorsk">Nynorsk</variant>
<keys>
This contains a list of elements that provide the user-translated names for the key values described in Section 3, Unicode Language and Locale Identifiers.
<key type="collation">Sortierung</key>
<types>
This contains a list of elements that provide the user-translated names for the type values described in Section 3, Unicode Language and Locale Identifiers. Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied.
<type type="phonebook" key="collation">Telefonbuch</type>
<measurementSystemNames>
This contains a list of elements that provide the user-translated names for systems of measurement. The types currently supported are "US", "metric", and "UK".
<measurementSystemName type="US">U.S.</type>
Note: In the future, we may need to add display names for the particular measurement units (millimeter versus millimetre versus whatever the Greek, Russian, etc are), and a message format for positioning those with respect to numbers. for example, "{number} {unitName}" in some languages, but "{unitName} {number}" in others.
<!ELEMENT layout ( alias | (orientation*, inList*, inText*, special*) ) >
This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).
<orientation lines="top-to-bottom" characters="left-to-right" />
The lines and characters attributes specify the default general ordering of lines within a page, and characters within a line. The values are:
| Vertical | top-to-bottom |
| bottom-to-top | |
| Horizontal | left-to-right |
| right-to-left |
If the lines value is one of the vertical attributes, then the characters value must be one of the horizontal attributes, and vice versa. For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian (in the Mongolian Script) the lines are right-to-left, and the characters are top to bottom. This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text (for more information, see UAX #9: The Bidirectional Algorithm [BIDI]).
For dates, times, and other data to appear in the right order, the display for them should be set to the orientation of the locale.
<inList>
The following element controls whether display names (language, territory, etc) are title cased in GUI menu lists and the like. It is only used in languages where the normal display is lower case, but title case is used in lists. There are two options:
<inList casing="titlecase-words">
<inList casing="titlecase-firstword">
In both cases, the title case operation is the default title case function defined by Chapter 3 of [Unicode]. In the second case, only the first word (using the word boundaries for that locale) will be title cased. The results can be fine-tuned by using alt="list" on any element where titlecasing as defined by the Unicode Standard will produce the wrong value. For example, suppose that "turc de Crimée" is a value, and the title case should be "Turc de Crimée". Then that can be expressed using the alt="list" value.
<inText>
This element indicates the casing of the data in the category identified by the inText type attribute, when that data is written in text or how it would appear in a dictionary. For example :
<inText type="languages">lowercase-words</inText>
indicates that language names embedded in text are normally written in lower case. The possible values and their meanings are :
<!ELEMENT characters (alias | (exemplarCharacters*, mapping*, special*)) >
The <characters> element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale.
<exemplarCharacters>[a-zåæø]</exemplarCharacters>
The exemplar character set contains the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [UCD], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included. In particular, format characters like CGJ are not included.
There are three sets: main, auxiliary, and currency. The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Note also that all of the main exemplars should be typeable with normal keyboards for that language. Major style guidelines are good references for the auxiliary set. Thus for English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary set.
In general, the test to see whether or not a letter belongs in the main set is based on whether it is acceptable in that language to always use spellings that avoid that character. For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ]. The main set typically includes those letters commonly taught in schools as the "alphabet".
The currency set allows other characters in currency symbols (like USD).
The list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified by Unicode properties.
Sequences of characters that act like a single letter in the language — especially in collation — are included within braces, such as [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} ...]. The characters should be in normalized form (NFC). Where combining marks are used generatively, and apply to a large number of base characters (such as in Indic scripts), the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character (for example, single codepoint) for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? page on the Unicode site, one would write: [{ch} {tʰ} {x̣} {ƛ̓} {ą́} {i̇́} {ト゚}], but for French one would just write [a-z é è ù ...]. When in doubt use braces, since it does no harm to included them around single code points: for example, [a-z {é} {è} {ù} ...].
If the letter 'z' were only ever used in the combination 'tz', then we might have [a-y {tz}] in the main set. (The language would probably have plain 'z' in the auxiliary set, for use in foreign words.) If combining characters can be used productively in combination with a large number of others (such as say Indic matras), then they are not listed in all the possible combinations, but separately, such as:
[ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔ ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]
The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language.
The ordering of the characters in the set is irrelevant, but for readability in the XML file the characters should be in sorted order according to the locale's conventions. The set should only contain lower case characters (except for the special case of Turkish and similar languages, where the dotted capital I should be included); the upper case letters are to be mechanically added when the set is used. For more information on casing, see the discussion of Special Casing in the Unicode Character Database.
<mapping registry="iana" type="iso-2022-jp utf-8" alt="email" />
The mapping element describes character conversion mapping tables that are commonly used to encode data in the language of this locale for a particular purpose. Each encoding is identified by a name from the specified registry. If more than one encoding is used for a particular purpose, the encodings are listed in the type attribute in order, from most preferred to least. An alt tag is used to indicate the purpose ("email" or "www" being the most frequent); if it is absent, then the encoding(s) may be used for all purposes not explicitly specified.
Each locale may have at most one mapping element tagged with a particular purpose, and at most one general-purpose mapping element. Inheritance is on an element basis; an element in a sub-locale overrides an inherited element with the same purpose.
Currently the only registry that can be used is "iana", which specifies use of an IANA name.
Note: While IANA names are not precise for conversion (see UTR #22: Character Mapping Tables [CharMapML]), they are sufficient for this purpose.
<!ELEMENT delimiters (alias | (quotationStart*, quotationEnd*, alternateQuotationStart*, alternateQuotationEnd*, special*)) >
The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:
He said, “Don’t be absurd!”
When quotations are nested, the quotation marks and alternate marks are used in an alternating fashion:
He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’”
<quotationStart>“</quotationStart>
<quotationEnd>”</quotationEnd>
<alternateQuotationStart>‘</alternateQuotationStart>
<alternateQuotationEnd>’</alternateQuotationEnd>
<!ELEMENT measurement (alias | (measurementSystem?, paperSize?, special*)) >
The measurement element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.
<!ELEMENT dates (alias | (localizedPatternChars*, calendars?, timeZoneNames?, special*)) >
This top-level element contains information regarding the format and parsing of dates and times. The data format is based on the Java/ICU format. Most of these are fairly self-explanatory, except the week elements, localizedPatternChars, and the meaning of the pattern characters. For information on this, and more information on other elements and attributes, see Appendix F: Date Format Patterns.
<!ELEMENT calendars (alias | (default*, calendar*, special*)) >
<!ELEMENT calendar (alias | (months?, monthNames?, monthAbbr?, days?, dayNames?, dayAbbr?, quarters?, week?, am*, pm*,
eras?, dateFormats?, timeFormats?, dateTimeFormats?, fields*, special*))>
This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month and quarter names are identified numerically, starting at 1. The day (of the week) names are identified with short strings, since there is no universally-accepted numeric designation.
Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and the years will be numbered within that era. All calendar data inherits from the Gregorian calendar in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.
<!ELEMENT months ( alias | (default*, monthContext*, special*)) >
<!ELEMENT monthContext ( alias | (default*, monthWidth*, special*)) >
<!ATTLIST monthContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT monthWidth ( alias | (month*, special*)) >
<!ATTLIST monthWidth type ( abbreviated| narrow | wide) #REQUIRED >
<!ELEMENT month ( #PCDATA | cp )* >
<!ATTLIST month type ( 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 ) #REQUIRED >
<!ELEMENT days ( alias | (default*, dayContext*, special*)) >
<!ELEMENT dayContext ( alias | (default*, dayWidth*, special*)) >
<!ATTLIST dayContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT dayWidth ( alias | (day*, special*)) >
<!ATTLIST dayWidth type NMTOKEN #REQUIRED >
<!ELEMENT day ( #PCDATA ) >
<!ATTLIST day type ( sun | mon | tue | wed | thu | fri | sat ) #REQUIRED >
<!ELEMENT quarters ( alias | (default*, quarterContext*, special*)) >
<!ELEMENT quarterContext ( alias | (default*, quarterWidth*, special*)) >
<!ATTLIST quarterContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT quarterWidth ( alias | (quarter*, special*)) >
<!ATTLIST quarterWidth type NMTOKEN #REQUIRED >
<!ELEMENT quarter ( #PCDATA ) >
<!ATTLIST quarter type ( 1 | 2 | 3 | 4 ) #REQUIRED >
Month, day, and quarter names may vary along two axes: the width and the context. The context is either format (the default), the form used within a date format string (such as "Saturday, November 12th", or stand-alone, the form used independently, such as in Calendar headers. The width can be wide (the default), abbreviated, or narrow. The format values must be distinct; that is, "S" could not be used both for Saturday and for Sunday. The same is not true for stand-alone values; they might only be distinguished by context, especially in the narrow format. That format is typically used in calendar headers; it must be the shortest possible width, no more than one character (or grapheme cluster) in stand-alone values, and the shortest possible widths (in terms of grapheme clusters) in format values.
Due to aliases in root, the forms inherit "sideways". (See Section 4.1 Multiple Inheritance.) For example, if the abbreviated format data for Gregorian does not exist in a language X (in the chain up to root), then it inherits from the wide format data in that same language X.
<monthContext type="format"> <default choice="wide"/> <monthWidth type="abbreviated"> <alias source="locale" path="../monthWidth[@type='wide']"/> </monthWidth> <monthWidth type="narrow"> <alias source="locale" path="../../monthContext[@type='stand-alone']/monthWidth[@type='narrow']"/> </monthWidth> <monthWidth type="wide"> <month type="1">1</month> ... <month type="12">12</month> </monthWidth> </monthContext> <monthContext type="stand-alone"> <monthWidth type="abbreviated"> <alias source="locale" path="../../monthContext[@type='format']/monthWidth[@type='abbreviated']"/> </monthWidth> <monthWidth type="narrow"> <month type="1">1</month> ... <month type="12">12</month> </monthWidth> <monthWidth type="wide"> <alias source="locale" path="../../monthContext[@type='format']/monthWidth[@type='wide']"/> </monthWidth> </monthContext>
The older monthNames, dayNames, and monthAbbr, dayAbbr are maintained for backwards compatibility. They are equivalent to: using the months element with the context type="format" and the width type="wide" (for ...Names) and type="narrow" (for ...Abbr), respectively. The minDays, firstDay, weekendStart, and weekendEnd elements are also deprecated; there are new elements in supplemental data for this data.
Example:
<calendar type="gregorian"> <months> <default type="format"/> <monthContext type="format"> <default type="wide"/> <monthWidth type="wide"> <month type="1">January</month> <month type="2">February</month> ... <month type="11">November</month> <month type="12">December</month> </monthWidth> <monthWidth type="abbreviated"> <month type="1">Jan</month> <month type="2">Feb</month> ... <month type="11">Nov</month> <month type="12">Dec</month> </monthWidth> <monthContext type="stand-alone"> <default type="wide"/> <monthWidth type="wide"> <month type="1">Januaria</month> <month type="2">Februaria</month> ... <month type="11">Novembria</month> <month type="12">Decembria</month> </monthWidth> <monthWidth type="narrow"> <month type="1">J</month> <month type="2">F</month> ... <month type="11">N</month> <month type="12">D</month> </monthWidth> </monthContext> </months> <days> <default type="format"/> <dayContext type="format"> <default type="wide"/> <dayWidth type="wide"> <day type="sun">Sunday</day> <day type="mon">Monday</day> ... <day type="fri">Friday</day> <day type="sat">Saturday</day> </dayWidth> <dayWidth type="abbreviated"> <day type="sun">Sun</day> <day type="mon">Mon</day> ... <day type="fri">Fri</day> <day type="sat">Sat</day> </dayWidth> <dayWidth type="narrow"> <day type="sun">Su</day> <day type="mon">M</day> ... <day type="fri">F</day> <day type="sat">Sa</day> </dayWidth> </dayContext> <dayContext type="stand-alone"> <dayWidth type="narrow"> <day type="sun">S</day> <day type="mon">M</day> ... <day type="fri">F</day> <day type="sat">S</day> </dayWidth> </dayContext> </days> <quarters> <default type="format"/> <quarterContext type="format"> <default type="abbreviated"/> <quarterWidth type="abbreviated"> <quarter type="1">Q1</quarter> <quarter type="2">Q2</quarter> <quarter type="3">Q3</quarter> <quarter type="4">Q4</quarter> </quarterWidth> <quarterWidth type="wide"> <quarter type="1">1st quarter</quarter> <quarter type="2">2nd quarter</quarter> <quarter type="3">3rd quarter</quarter> <quarter type="4">4th quarter</quarter> </quarterWidth> </quarterContext> </quarters> <am>AM</am> <pm>PM</pm> <eras> <eraAbbr> <era type="0">BC</era> <era type="1">AD</era> </eraAbbr> <eraNames> <era type="0">Before Christ</era> <era type="1">Anno Domini</era> </eraNames> <eraNarrow> <era type="0">B</era> <era type="1">A</era> </eraNarrow> </eras>
<!ELEMENT dateFormats (alias | (default*, dateFormatLength*, special*)) >
<!ELEMENT dateFormatLength (alias | (default*, dateFormat*, special*)) >
<!ATTLIST dateFormatLength type ( full | long | medium | short ) #REQUIRED >
<!ELEMENT dateFormat (alias | (pattern*, displayName*, special*)) >
Date formats have the following form:
<dateFormats>
<default type=”medium”/>
<dateFormatLength type=”full”>
<dateFormat>
<pattern>EEEE, MMMM d, yyyy</pattern>
</dateFormat>
</dateFormatLength>
<dateFormatLength type="medium">
<default type="DateFormatsKey2">
<dateFormat type="DateFormatsKey2">
<pattern>MMM d, yyyy</pattern>
</dateFormat>
<dateFormat type="DateFormatsKey3">
<pattern>MMM dd, yyyy</pattern>
</dateFormat>
</dateFormatLength>
<dateFormats>
<!ELEMENT timeFormats (alias | (default*, timeFormatLength*, special*)) >
<!ELEMENT timeFormatLength (alias | (default*, timeFormat*, special*)) >
<!ATTLIST timeFormatLength type ( full | long | medium | short ) #REQUIRED >
<!ELEMENT timeFormat (alias | (pattern*, displayName*, special*)) >
Time formats have the following form:
<timeFormats>
<default type="medium"/>
<timeFormatLength type=”full”>
<timeFormat>
<displayName>DIN 5008 (EN 28601)</displayName>
<pattern>h:mm:ss a z</pattern>
</timeFormat>
</timeFormatLength>
<timeFormatLength type="medium">
<timeFormat>
<pattern>h:mm:ss a</pattern>
</timeFormat>
</timeFormatLength>
</timeFormats>
The preference of 12 hour versus 24 hour for the locale should be derived from the short timeFormat. If the hour symbol is "h" or "K" (of various lengths) then the format is 12 hour; otherwise it is 24 hour.
Date/Time formats have the following form:
<dateTimeFormats>
<default type="medium"/>