|
|
| Version | 1.4.1 |
| Authors | Mark Davis (mark.davis@google.com) |
| Date | 2006-11-03 |
| This Version | http://unicode.org/reports/tr35/tr35-7.html |
| Previous Version | http://unicode.org/reports/tr35/tr35-6.html |
| Latest Version | http://unicode.org/reports/tr35/ |
| Corrigenda | http://unicode.org/cldr/corrigenda.html |
| Latest Working Draft | http://unicode.org/draft/reports/tr35/tr35.html |
| Namespace: | http://unicode.org/cldr/ |
| DTDs: | http://unicode.org/cldr/dtd/1.4.1/ldml.dtd http://unicode.org/cldr/dtd/1.4.1/ldmlSupplemental.dtd |
| Revision | 7 |
This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Common Locale Data Repository maintained by the Unicode Consortium.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For possible errata for this document, see [Errata].
Appendix A: Sample Special Elements
Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. But there remain differences in the locale data used by different systems.
Common, recommended practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but resulting in different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)
Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
This document specifies an XML format for the communication of locale data: the Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings (see [UCA]). LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository project page [LocaleProject].
Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use the data, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's timezone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, etc.), music preference, religion, party affiliation, favorite charity, etc.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, etc.). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Appendix D: Language and Locale IDs.
We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.
LDML uses stable identifiers for distinguishing among locales, regions, currencies, timezones, transforms, and so on. Within each type of entity, such as locales or such as currencies, the identifiers are unique. However, across types the identifiers may not be unique: thus a currency identifier may be the same as a locale identifier (especially since identifiers are compared caselessly).
There are many systems for identifiers for these entities. The LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.
An LDML locale identifier is either "root", or has the following format:
locale_id := base_locale_id options?
base_locale_id := extended_RFC3066bis_identifiers
options := "@" key "=" type ("," key "=" type )*
As usual, x? means that x is optional; x* means that x occurs zero or more times.
For historical reasons, this is called a locale ID. However, it really functions (with few exceptions) as a language ID, and accesses language-based data. There used to be some information that was improperly included in the language-based data, like default currency and weekend ranges, but that was removed over time; moved to supplemental files. Those supplemental data files represent not so much "locale" data as non-language data. However, except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data.
A locale ID is an extension of a language ID, and thus the structure and field values are based on the successor to RFC 3066, known as RFC3066bis, which has as been approved, but not yet published. However, the registry of data for that successor is now being maintained by IANA. For that registry, and the editor's draft of the standard, see [RFC3066bis]. The canonical form of a locale ID uses "_" instead of the "-" used in RFC3066bis; however, implementations providing APIs for CLDR locale IDs should treat "-" as equivalent to "_" on input. The most common format for the base_locale_id is a series of one or more fields of the form:
language_code ("_" script_code)? ("_"
territory_code)? ("_" variant_code)?
The field values are given in the following table. All field values are case-insensitive, except for the type, which is case-sensitive. However, customarily the language code is lowercase, the territory and variant codes are uppercase, the script code is titlecase (that is, first character uppercase and other characters lowercase), and variants are uppercase. This convention is used in the file names, which may be case-sensitive depending on the operating system. Customarily the currency IDs are uppercase and timezone IDs are titlecase by field (as defined in the timezone database); other key and type codes are lowercase. The type may also be referred to as a key-value, for clarity.
Note that some private use field values may be given specific values when used with LDML.
| Field | Allowable Characters | Allowable values | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| language_code | ASCII letters | [RFC3066bis]
subtag values marked as Type: language Extensions: In some exceptional cases, draft [ISO639] codes may be used in CLDR, if in the judgment of the technical committee they are essentually assured of being added. These currently include:
Users should however be aware that if these codes are not accepted into [RFC3066bis], that they will be replaced by whatever codes are used, or by private use codes. The private use codes from qfz..qtz will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data. |
||||||||||||||||
| script_code | ASCII letters | [RFC3066bis]
subtag values marked as Type: script In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:
CLDR allows for the use of the Unicode Script values [UAX24]:
The private use codes from Qaaq..Qabx will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data. |
||||||||||||||||
| territory_code | ASCII letters, numbers | [RFC3066bis]
subtag values marked as Type: region, or any UN M.49 code that
doesn't correspond to a [RFC3066bis]
region subtag. There are three private use codes defined in LDML:
The private use codes from XA..XZ will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data. |
||||||||||||||||
| variant_code | ASCII letters | Values used in CLDR are discussed below. For information on the process for adding new standard variants or element/type pairs, see [LocaleProject]. | ||||||||||||||||
| key | ASCII letters and digits | |||||||||||||||||
| type | ASCII letters, digits, and "-" |
Examples:
en fr_BE de_DE@collation=phonebook,currency=DDM
The locale id format generally follows the description in the OpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from the those guidelines are that the locale id:
Note: The language + script + territory code combination can itself be considered simply a language code: For more information, see Appendix D: Language and Locale IDs.
A locale that only has a language code (and possibly a script code) is called a language locale; one with both language and territory code as well is called a territory locale (or country locale).
The variant codes specify particular variants of the locale, typically with special options. They cannot overlap with script or territory codes, so they must have either one letter or have more than 4 letters. The currently defined variants include:
| variant | Description |
|---|---|
| <RFC 3066bis variants> | As defined in [RFC3066bis], plus: |
| BOKMAL | Bokmål, variant of Norwegian (deprecated: use nb) |
| NYNORSK | Nynorsk, variant of Norwegian (deprecated: use nn) |
| AALAND | Åland, variant of Swedish used in Finland (deprecated: use AX) |
| POSIX | A POSIX-style invariant locale. |
| REVISED | For revised orthography |
| SAAHO | The Saaho variant of Afar |
Note: The first two of the above variants are for backwards compatibility. Typically the entire contents of these are defined by an <alias> element pointing at nb_NO (Norwegian Bokmål) and nn_NO(Norwegian Nynorsk) locale IDs. See also Appendix K: Valid Attribute Values.
The locale IDs corresponding to grandfathered [RFC3066bis] language tags are permitted, but not recommended.
The currently defined optional key/type combinations include the following. Additional type values are defined in the detail sections of this document or in Appendix K: Valid Attribute Values. The assignment of values needs to ensure that they are unique if truncated to 8 letters.
| key | type | Description |
|---|---|---|
| collation | phonebook | For a phonebook-style ordering (used in German). |
| pinyin | Pinyin ordering for Latin and for CJK characters (that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin) | |
| traditional | For a traditional-style sort (as in Spanish) | |
| stroke | Pinyin ordering for Latin, stroke order for CJK characters | |
| direct | Hindi variant | |
| posix | A "C"-based locale. | |
| big5han | Pinyin ordering for Latin, big5 charset ordering for CJK characters. | |
| gb2312han | Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. | |
| calendar* | gregorian | (default) |
| islamic
alias: arabic |
Astronomical Arabic | |
| chinese | Traditional Chinese calendar | |
| islamic-civil
alias: civil-arabic |
Civil (algorithmic) Arabic calendar | |
| hebrew | Traditional Hebrew Calendar | |
| japanese | Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor) | |
| buddhist
alias: thai-buddhist |
Thai Buddhist Calendar (same as Gregorian except for the year) | |
| persian | Persian Calendar | |
| coptic | Coptic Calendar | |
| ethiopic | Ethiopic Calendar | |
| *For information on the calendar algorithms associated with the data used with these types, see [Calendars]. | ||
collation parameters:
|
associated values as defined in: 5.13.1 <collation> | semantics as defined in: 5.13.1 <collation> |
| currency | ISO 4217 code | Currency value identified by ISO code, plus others in common use. See Appendix K: Valid Attribute Values and also [Data Formats] |
| timezone | TZID | Identification for timezone according to the TZ Database. See [Data Formats]. |
For more information on the allowed attribute values, see the specific elements below, and Appendix K: Valid Attribute Values.
CLDR Locale IDs can be converted to valid RFC 3066bis language tags by performing the following transformation.
Thus for example, we get the following conversion:
| CLDR | en_US_POSIX@calendar=islamic,collation=traditional,colStrength=secondary |
| RFC3066bis | en-US-x-ldml-POSIX-k-calendar-islamic-k-collation-traditio-k-colStren-secondar |
The following identifiers are used to indicate an unknown or invalid code in CLDR. The Region and Timezone code are additional codes provided by CLDR; the others are defined by the relevant standards. When these codes are used in APIs connected with CLDR, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.
| Code Type | Value | Description in Referenced Standards |
|---|---|---|
| Language | und |
Undetermined language |
| Script | Zzzz |
Code for uncoded script, Unknown [UAX24] |
| Region | ZZ |
Unknown or Invalid Territory |
| Currency | XXX |
The codes assigned for transactions where no currency is involved |
| Timezone | Etc/Unknown |
Unknown or Invalid Timezone |
When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.
For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092) . CLDR supplies a standard mappings to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:
| Region | UN/ISO Numeric | ISO 3-Letter |
|---|---|---|
AA |
958 |
AAA |
QM..QZ |
959..972 |
QMM..QZZ |
XA..XZ |
973..998 |
XAA..XZZ |
ZZ |
999 |
ZZZ |
For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):
| Script | Numeric |
|---|---|
Qaaa..Qabx |
900..949 |
The XML format relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is the default Unicode Collation Algorithm order (see [UCA]). Since English language collation has the same ordering, the 'en' locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.
Given a particular locale id "en_US_someVariant", the search chain for a particular resource is the following.
en_US_someVariant en_US en root
If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.
Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_US" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance.
Where this inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.
For a more complete description of how inheritance applies to data, and the use of keywords, see Appendix I: Inheritance and Validity.
The locale data does not contain general character properties that are derived from the Unicode Character Database [UCD]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
Warning: If a locale has a different script than its parent (eg sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.
In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist calendar inherits from the Gregorian calendar. This only happens where documented in this specification. In these special cases, the inheritance functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the alternate values.
For example, for the locale "en_US" the month data in <calendar class="buddhist"> inherits first from <calendar class="buddhist"> in "en", then in "root". If not found there, then it inherits from <calendar type="gregorian"> in "en_US", then "en", then in "root".
There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.
For example, the language-dependent data for Japanese in CLDR is present in the following files:
The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file.
Supplemental data relating to Japan or the Japanese writing system can be found in:
The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the DTD, listed at the top of this document; however, the DTD does not describe all the constraints on the structure.
To start with, the root element is <ldml>, with the following DTD entry:
<!ELEMENT ldml (identity, (alias |(localeDisplayNames?, layout?, characters?, delimiters?, measurement?, dates?, numbers?, collations?, posix?, special*))) >
That element contains the following elements:
The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more involved.
The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information.
In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.
There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:
Structure elements do not have this restriction, but also do not inherit, except as an entire block. The structure elements are listed in serialElements in the supplemental metadata. See also Appendix I: Inheritance and Validity.
Note that the data in examples given below is purely illustrative, and doesn't match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.
In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:
<languages>
<language type="aa">...</language>
<language type="aa">..</language>
There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an alt element).
In general, data should be in NFC format. Exceptions to this include transforms, segmentations, and pc/sc/tc/qc/ic rules in collation. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining backslash.
Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters, and that leading and trailing spaces are to be ignored (that is, they behave like NMTOKENS). These include:
At any level in any element, two special elements are allowed.
<special xmlns:yyy="xxx">
This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML namespace of the special data. For example, the following used the version 1.0 POSIX special element.
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.0/ldml.dtd" [ <!ENTITY % posix SYSTEM "http://unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd"> %posix; ]> <ldml> ... <special xmlns:posix="http://www.opengroup.org/regproducts/xu.htm"> <!-- old abbreviations for pre-GUI days --> <posix:messages> <posix:yesstr>Yes</posix:yesstr> <posix:nostr>No</posix:nostr> <posix:yesexpr>^[Yy].*</posix:yesexpr> <posix:noexpr>^[Nn].*</posix:noexpr> </posix:messages> </special> </ldml>
<alias source="<locale_ID>" path="..."/>
The contents of any element can be replaced by an alias, which points to another source for the data. The elements in that source are to be fetched from the corresponding location in the other source. Normal resource searching is to be used; take the following example:
<ldml>
<collations>
<collation type="phonebook">
<alias source="de_DE">
</collation>
</collations>
</ldml>
The resource bundle at "de_DE" will be searched for a resource element at the same position in the tree with type "collation". If not found there, then the resource bundle at "de" will be searched, etc. For an example of how this works with inheritance, look at the following table (where green indicates inherited items). Note in particular that an alias "reroutes" the inheritance; nothing in the parent affects the contents of an item with an alias. Thus the red item below is blocked.
| en | en_US | Resolved | ||
|---|---|---|---|---|
<x> |
<x> |
<x> |
||
| de | de_DE | Resolved | de_DE_1901 | Resolved |
<x> |
<x> |
<x> |
<x> |
<x> |
If the path attribute is present, then its value is an XPath that points to a different node in the tree. For example:
<alias source="root" path="../monthWidth[@type='wide']"/>
The default value if the path is not present is the same position in the tree. All of the attributes in the XPath must be distinguishing elements. For more details, see Appendix I: Inheritance and Validity.
There is a special value for the source attribute, the constant source="locale", which is the default value. This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:
| Root | de | Resolved |
|---|---|---|
<x> |
<x> |
<x> |
<y> |
<y> |
<y> |
The first row shows the inheritance within the <x> element, whereby <c> is inherited from root. The second shows the inheritance within the <y> element, whereby <a>, <c>, and <d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.
For more details on data resolution, see Appendix I: Inheritance and Validity.
It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and multiple inheritance) can be followed indefinitely without terminating.
<displayName>
Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.
<numberFormat>
<displayName>Prozentformat</displayName>
...
<numberFormat>
Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in timezones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different timezone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].
<default type="someID"/>
In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of the type attribute is to match the value of the type attribute for the selected item.
<timeFormats> <default type="medium" /> <timeFormatLength type="full"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a z</pattern> </timeFormat> </timeFormatLength> <timeFormatLength type="long"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a z</pattern> </timeFormat> </timeFormatLength> <timeFormatLength type="medium"> <timeFormat type="standard"> <pattern type="standard">h:mm:ss a</pattern> </timeFormat> </timeFormatLength> ...
Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in fr, and that in fr_BE we have the following:
<timeFormats>
<default type="long"/>
</timeFormats>
In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:
<timeFormatLength type="medium"> <timeFormat type="standard"> <pattern type="standard">...</pattern> </timeFormat> </timeFormatLength>
In this case, the <default> is inherited from fr, and has the value "medium". It thus refers to this new "medium" pattern in this resource bundle.
Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content. These escapes are only allowed in certain elements, according to the DTD.
| Code Point | XML Example |
|---|---|
U+0000 |
<cp hex="0"> |
<... type="stroke" ...>
The attribute type is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element. For example:
<ldml>
...
<currencies>
<currency>...</currency>
<currency type="preEuro">...</currency>
</currencies>
</ldml>
<... draft="unconfirmed" ...>
If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary draft value), as per the following:
Normally draft attributes should only occur on "leaf" elements. For a more formal description of how elements are inherited, and what their draft status is, see Appendix I: Inheritance and Validity.
<... alt="descriptor" ...>
This attribute labels an alternative value for an element. The descriptor indicates what kind of alternative it is, and takes one of the following forms:
"proposed" should only be present if the draft status is not "approved". It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as alt="proposed" until it is vetted.
... <month type="9">Settembru</month> <month type="9" draft="unconfirmed" alt="proposed">Settembro</month> <month type="10">...
Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:
... <month type="9" draft="unconfirmed" alt="proposed2">Settembre</month> ...
The allowable values for variantname at this time are "variant", "list", "email", "www", and "secondary". This may be expanded in the future.
<... validSubLocales="de_AT de_CH de_DE" ...>
The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there isn't one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements wouldn't otherwise be draft.
For a more complete description of how draft applies to data, see Appendix I: Inheritance and Validity.
<... standard="..." ...>
Note: This attribute is deprecated. Instead, use a reference element with the attribute standard="true". See Section 5.12 <references>.
The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:
<collation standard="MSA 200:2002">
...
<dateFormatStyle standard=”http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30”>
<... references="..." ...>
The value of this attribute is a list of strings, separated by spaces, each representing a reference for the information in the element, including standards that it may conform to. The best format is a series of tokens, where each token corresponds to a reference element. See Section 5.12 <references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)
Example:
<territory type="UM" references="R1 R2">USAs yttre öar</territory>
The reference element may be inherited. Thus, for example, R2 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.
<!ELEMENT identity (alias | (version, generation, language, script?, territory?, variant?, special*) ) >
The identity element contains information identifying the target locale for this data, and general information about the version of this data.
<version number="$Revision: 1.212 $">
The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. For example:
<version number="1.1">Various notes and changes in version 1.1</version>This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.
<generation date="$Date: 2006/10/28 03:19:43 $" />
The generation element contains the last modified date for the data. This can be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).
<language type="en"/>
The language code is the primary part of the specification of the locale id, with values as described above.
<script type="Latn" />
The script field may be used in the identification of written languages, with values described above.
<territory type="US"/>
The territory code is a common part of the specification of the locale id, with values as described above.
<variant type="NYNORSK"/>
The variant code is the tertiary part of the specification of the locale id, with values as described above.
<!ELEMENT localeDisplayNames (alias | (languages?, scripts?, territories?, variants?, keys?, types?, measurementSystemNames?, special*)) >
Display names for scripts, languages, countries, and variants in this locale are supplied by this element. These supply localized names for these items for use in user-interfaces for displaying lists of locales and scripts. Examples are given below.
Note: The "en" locale may contain translated names for deprecated codes for debugging purposes. Translation of deprecated codes into other languages is discouraged.
Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in timezones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different timezone IDs.)
Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].
<languages>
This contains a list of elements that provide the user-translated names for language codes, as described in Section 3, Identifiers.
<language type="ab">Abkhazian</language> <language type="aa">Afar</language> <language type="af">Afrikaans</language> <language type="sq">Albanian</language>
The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition.
| type | translation | composition |
|---|---|---|
| nl_BE | Flemish | Dutch (Belgium) |
| zh_Hans | Simplified Chinese | Chinese (Simplified Han) |
| en_GB | British English | English (United Kingdom) |
Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields (if any) added using composition.
<scripts>
This element can contain an number of script elements. Each script element provides the localized name for a script code, as described in Section 3, Identifiers (see also UAX #24: Script Names [Scripts]). For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica". That would be expressed with the following.
<script type="Latn">Romana</script>
<script type="Cyrl">Kyrillica</script>
<territories>
This contains a list of elements that provide the user-translated names for territory codes, as described in Section 3, Identifiers.
<territory type="AF">Afghanistan</territory>
<territory type="AL">Albania</territory>
<territory type="DZ">Algeria</territory>
<territory type="AD">Andorra</territory>
<territory type="AO">Angola</territory>
<territory type="US">United States</territory>
<variants>
This contains a list of elements that provide the user-translated names for the variant_code values described in Section 3, Identifiers.
<variant type="nynorsk">Nynorsk</variant>
<keys>
This contains a list of elements that provide the user-translated names for the key values described in Section 3, Identifiers.
<key type="collation">Sortierung</key>
<types>
This contains a list of elements that provide the user-translated names for the type values described in Section 3, Identifiers. Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied.
<type type="phonebook" key="collation">Telefonbuch</type>
<measurementSystemNames>
This contains a list of elements that provide the user-translated names for systems of measurement. The types currently supported are "US", "metric", and "UK".
<measurementSystemName type="US">U.S.</type>
Note: In the future, we may need to add display names for the particular measurement units (millimeter vs millimetre vs whatever the Greek, Russian, etc are), and a message format for positioning those with respect to numbers. E.g. "{number} {unitName}" in some languages, but "{unitName} {number}" in others.
<!ELEMENT layout ( alias | (orientation?, inList*, special*) ) >
This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).
<orientation lines="top-to-bottom" characters="left-to-right" />
The lines and characters attributes specify the default general ordering of lines within a page, and characters within a line. The values are:
| Vertical | top-to-bottom |
| bottom-to-top | |
| Horizontal | left-to-right |
| right-to-left |
If the lines value is one of the vertical attributes, then the characters value must be one of the horizontal attributes, and vice versa. For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian (in the Mongolian Script) the lines are right-to-left, and the characters are top to bottom. This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text (for more information, see UAX #9: The Bidirectional Algorithm [BIDI]).
For dates, times, and other data to appear in the right order, the display for them should be set to the orientation of the locale.
<inList>
The following element controls whether display names (language, territory, etc) are titlecased in GUI menu lists and the like. It is only used in languages where the normal display is lowercase, but titlecase is used in lists. There are two options:
<inList casing="titlecase-words">
<inList casing="titlecase-firstword">
In both cases, the titlecase operation is the default titlecase function defined by Chapter 3 of [Unicode]. In the second case, only the first word (using the word boundaries for that locale) will be titlecased. The results can be fine-tuned by using alt="list" on any element where titlecasing as defined by the Unicode Standard will produce the wrong value. For example, suppose that "turc de Crimée" is a value, and the titlecase should be "Turc de Crimée". Then that can be expressed using the alt="list" value.
<!ELEMENT characters (alias | (exemplarCharacters*, mapping*, special*)) >
The <characters> element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale.
<exemplarCharacters>[a-zåæø]</exemplarCharacters>
The exemplar character set contains the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [UCD], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included.
There are two sets: the main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Major style guidelines are good references for the auxiliary set. Thus for English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary set.
In general, the test to see whether or not a letter belongs in the main set is based on whether it is acceptable in that language to always use spellings that avoid that character. For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ]. The main set typically includes those letters commonly taught in schools as the "alphabet".
The list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified by Unicode properties.
Sequences of characters that act like a single letter in the language — especially in collation — are included within braces, such as [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} ...]. The characters should be in normalized form (NFC). Where combining marks are used generatively, and apply to a large number of base characters (such as in Indic scripts), the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character (e.g. single codepoint) for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? page on the Unicode site, one would write: [{ch} {tʰ} {x̣} {ƛ̓} {ą́} {i̇́} {ト゚}], but for French one would just write [a-z é è ù ...]. When in doubt use braces, since it does no harm to included them around single code points: e.g. [a-z {é} {è} {ù} ...].
If the letter 'z' were only ever used in the combination 'tz', then we might have [a-y {tz}] in the main set. (The language would probably have plain 'z' in the auxiliary set, for use in foreign words.) If combining characters can be used productively in combination with a large number of others (such as say Indic matras), then they are not listed in all the possible combinations, but separately, such as:
[ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔ ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]
The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language.
The ordering of the characters in the set is irrelevant, but for readability in the XML file the characters should be in sorted order according to the locale's conventions. The set should only contain lower case characters (except for the special case of Turkish and similar languages, where the dotted capital I should be included); the uppercase letters are to be mechanically added when the set is used. For more information, see [Data Formats] and the discussion of Special Casing in the Unicode Character Database.
<mapping registry="iana" type="iso-2022-jp utf-8" alt="email" />
The mapping element describes character conversion mapping tables that are commonly used to encode data in the language of this locale for a particular purpose. Each encoding is identified by a name from the specified registry. If more than one encoding is used for a particular purpose, the encodings are listed in the type attribute in order, from most preferred to least. An alt tag is used to indicate the purpose ("email" or "www" being the most frequent); if it is absent, then the encoding(s) may be used for all purposes not explicitly specified.
Each locale may have at most one mapping element tagged with a particular purpose, and at most one general-purpose mapping element. Inheritance is on an element basis; an element in a sub-locale overrides an inherited element with the same purpose.
Currently the only registry that can be used is "iana", which specifies use of an IANA name.
Note: While IANA names are not precise for conversion (see UTR #22: Character Mapping Tables [CharMapML]), they are sufficient for this purpose.
<!ELEMENT delimiters (alias | (quotationStart*, quotationEnd*, alternateQuotationStart*, alternateQuotationEnd*, special*)) >
The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:
He said, “Don’t be absurd!”
When quotations are nested, the quotation marks and alternate marks are used in an alternating fashion:
He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’”
<quotationStart>“</quotationStart>
<quotationEnd>”</quotationEnd>
<alternateQuotationStart>‘</alternateQuotationStart>
<alternateQuotationEnd>’</alternateQuotationEnd>
<!ELEMENT measurement (alias | (measurementSystem?, paperSize?, special*)) >
The measurement element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.
<!ELEMENT dates (alias | (localizedPatternChars*, calendars?, timeZoneNames?, special*)) >
This top-level element contains information regarding the format and parsing of dates and times. The data format is based on the Java/ICU format. Most of these are fairly self-explanatory, except the week elements, localizedPatternChars, and the meaning of the pattern characters. For information on this, and more information on other elements and attributes, see Appendix F: Date Format Patterns.
<!ELEMENT calendar (alias | (months?, monthNames?, monthAbbr?, days?, dayNames?, dayAbbr?, quarters?, week?, am?, pm?, eras?, dateFormats?, timeFormats?, dateTimeFormats?, fields*, special*))>
This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month and quarter names are identified numerically, starting at 1. The day (of the week) names are identified with short strings, since there is no universally-accepted numeric designation.
Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and the years will be numbered within that era. All calendar data inherits from the Gregorian calendar in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.
<!ELEMENT months ( alias | (default?, monthContext*, special*)) >
<!ELEMENT monthContext ( alias | (default?, monthWidth*, special*)) >
<!ELEMENT monthWidth ( alias | (month*, special*)) >
<!ELEMENT days ( alias | (default?, dayContext*, special*)) >
<!ELEMENT dayContext ( alias | (default?, dayWidth*, special*)) >
<!ELEMENT dayWidth ( alias | (day*, special*)) >
<!ELEMENT quarters ( alias | (default?,
quarterContext*, special*)) >
<!ELEMENT quarterContext ( alias | (default?, quarterWidth*, special*)) >
<!ELEMENT quarterWidth ( alias | (quarter*, special*)) >
Month, day, and quarter names may vary along two axes: the width and the context. The context is either format (the default), the form used within a date format string (such as "Saturday, November 12th", or stand-alone, the form used independently, such as in Calendar headers. The width can be wide (the default), abbreviated, or narrow. The format values must be distinct; that is, "S" could not be used both for Saturday and for Sunday. The same is not true for stand-alone values; they might only be distinguished by context, especially in the narrow format. That format is typically used in calendar headers; it must be the shortest possible width, no more than one character (or grapheme cluster) in stand-alone values, and the shortest possible widths (in terms of grapheme clusters) in format values.
If the stand-alone form does not exist (in the chain up to root), then it inherits from the format form. See Section 4.1 Multiple Inheritance. If the narrow format does not exist, it inherits from the abbreviated form; if the abbreviated format does not exist, it inherits from the wide format.
The older monthNames, dayNames, and monthAbbr, dayAbbr are maintained for backwards compatibility. They are equivalent to: using the months element with the context type="format" and the width type="wide" (for ...Names) and type="narrow" (for ...Abbr), respectively. The minDays, firstDay, weekendStart, and weekendEnd elements are also deprecated; there are new elements in supplemental data for this data.
Example:
<calendar type="gregorian"> <months> <default type="format"/> <monthContext type="format"> <default type="wide"/> <monthWidth type="wide"> <month type="1">January</month> <month type="2">February</month> ... <month type="11">November</month> <month type="12">December</month> </monthWidth> <monthWidth type="abbreviated"> <month type="1">Jan</month> <month type="2">Feb</month> ... <month type="11">Nov</month> <month type="12">Dec</month> </monthWidth> <monthContext type="stand-alone"> <default type="wide"/> <monthWidth type="wide"> <month type="1">Januaria</month> <month type="2">Februaria</month> ... <month type="11">Novembria</month> <month type="12">Decembria</month> </monthWidth> <monthWidth type="narrow"> <month type="1">J</month> <month type="2">F</month> ... <month type="11">N</month> <month type="12">D</month> </monthWidth> </monthContext> </months> <days> <default type="format"/> <dayContext type="format"> <default type="wide"/> <dayWidth type="wide"> <day type="sun">Sunday</day> <day type="mon">Monday</day> ... <day type="fri">Friday</day> <day type="sat">Saturday</day> </dayWidth> <dayWidth type="abbreviated"> <day type="sun">Sun</day> <day type="mon">Mon</day> ... <day type="fri">Fri</day> <day type="sat">Sat</day> </dayWidth> <dayWidth type="narrow"> <day type="sun">Su</day> <day type="mon">M</day> ... <day type="fri">F</day> <day type="sat">Sa</day> </dayWidth> </dayContext> <dayContext type="stand-alone"> <dayWidth type="narrow"> <day type="sun">S</day> <day type="mon">M</day> ... <day type="fri">F</day> <day type="sat">S</day> </dayWidth> </dayContext> </days> <quarters> <default type="format"/> <quarterContext type="format"> <default type="abbreviated"/> <quarterWidth type="abbreviated"> <quarter type="1">Q1</quarter> <quarter type="2">Q2</quarter> <quarter type="3">Q3</quarter> <quarter type="4">Q4</quarter> </quarterWidth> <quarterWidth type="wide"> <quarter type="1">1st quarter</quarter> <quarter type="2">2nd quarter</quarter> <quarter type="3">3rd quarter</quarter> <quarter type="4">4th quarter</quarter> </quarterWidth> </quarterContext> </quarters> <am>AM</am> <pm>PM</pm> <eras> <eraAbbr> <era type="0">BC</era> <era type="1">AD</era> </eraAbbr> <eraNames> <era type="0">Before Christ</era> <era type="1">Anno Domini</era> </eraNames> <eraNarrow> <era type="0">B</era> <era type="1">A</era> </eraNarrow> </eras>
<!ELEMENT dateFormats (alias | (default?, dateFormatLength*, special*)) >
<!ELEMENT dateFormatLength (alias | (default?, dateFormat*, special*)) >
<!ELEMENT dateFormat (alias | (pattern*, displayName?, special*)) >
Date formats have the following form:
<dateFormats>
<default type=”medium”/>
<dateFormatLength type=”full”>
<dateFormat>
<pattern>EEEE, MMMM d, yyyy</pattern>
</dateFormat>
</dateFormatLength>
<dateFormatLength type="medium">
<default type="DateFormatsKey2">
<dateFormat type="DateFormatsKey2">
<pattern>MMM d, yyyy</pattern>
</dateFormat>
<dateFormat type="DateFormatsKey3">
<pattern>MMM dd, yyyy</pattern>
</dateFormat>
</dateFormatLength>
<dateFormats>
<!ELEMENT timeFormats (alias | (default?, timeFormatLength*, special*)) >
<!ELEMENT timeFormatLength (alias | (default?, timeFormat*, special*)) >
<!ELEMENT timeFormat (alias | (pattern*, displayName?,
special*)) >
Time formats have the following form:
<timeFormats>
<default type="medium"/>
<timeFormatLength type=”full”>
<timeFormat>
<displayName>DIN 5008 (EN 28601)</displayName>
<pattern>h:mm:ss a z</pattern>
</timeFormat>
</timeFormatLength>
<timeFormatLength type="medium">
<timeFormat>
<pattern>h:mm:ss a</pattern>
</timeFormat>
</timeFormatLength>
</timeFormats>
The preference of 12 hour vs 24 hour for the locale should be derived from the short timeFormat. If the hour symbol is "h" or "K" (of various lengths) then the format is 12 hour; otherwise it is 24 hour.
Date/Time formats have the following form:
<dateTimeFormats>
<default type="medium"/>
<dateTimeFormatLength type=”full”>
<dateTimeFormat>
<pattern>{0} {1}</pattern>
</dateTimeFormat>
</dateTimeFormatLength>
<availableFormats>
<dateFormatItem>d. MMM yy</dateFormatItem>
<dateFormatItem>hh:mm:ss a</dateFormatItem>
<dateFormatItem>MMMM yyyy</dateFormatItem>
<dateFormatItem>MMM yy</dateFormatItem>
. . .
</availableFormats>
<appendItems>
<appendItem request="G">{0} {1}</appendItem>
<appendItem request="w">{0} ({2}: {1})</appendItem>
. . .
</appendItems>
</dateTimeFormats>
</calendar> <calendar type="buddhist"> <eras> <eraAbbr> <era type="0">BE</era> </eraAbbr> </eras> </calendar>
<!ELEMENT dateTimeFormats (alias | (default?,
dateTimeFormatLength*, availableFormats*, appendItems*, special*)) >
<!ELEMENT dateTimeFormatLength (alias | (dateTimeFormat*, special*))>
<!ELEMENT dateTimeFormat (alias | (pattern*, special*))>
<!ELEMENT availableFormats (alias | (dateFormatItem*, special*))>
<!ELEMENT appendItems (alias | (appendItem*, special*))>
<!ATTLIST appendItem request CDATA >
These formats allow for date and time formats to be composed in various ways. The dateTimeFormat element works like the dateFormats and timeFormats, except that the pattern is of the form "{1} {0}", where {0} is replaced by the time format, and {1} is replaced by the date format, with results such as "8/27/06 7:31 AM".
The availableFormats element and its subelements provide a more flexible formatting mechanism than the predefined list of patterns represented by dateFormatLength, timeFormatLength, and dateTimeFormatLength. Instead, there is an open-ended list of patterns (represented by dateFormatItem elements as well as the predefined patterns mentioned above) that can be matched against a requested set of calendar fields and field lengths. Software can look through the list and find the pattern that best matches the original request, based on the desired calendar fields and lengths. For example, the full month and year may be needed for a calendar application; the request is MMMMyyyy, but the best match may be "yyyy MMMM" or even "G yy MMMM", depending on the locale and calendar.
The id attribute is a so-called "skeleton", containing only field information, and in a canonical order. Examples are "yyyyMMMM" for year + full month, or "MMMd" for abbreviated month + day.
In case the best match does not include all the requested calendar fields, the appendItems element describes how to append needed fields to one of the existing formats. Each appendItem element covers a single calendar field. In the pattern, {0} represents the format string, {1} the data content of the field, and {2} the display name of the field (see Calendar Fields).
<!ELEMENT week (alias | (minDays?, firstDay?, weekendStart?, weekendEnd?, special*))>
The week element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.
<!ELEMENT fields ( alias | (field*, special*)) >
<!ELEMENT field ( alias | (displayName?, relative*, special*)) >
Translations may be supplied for names of calendar fields (elements of a calendar, such
as Day, Month, Year, Hour, etc.), and for relative values for those fields (for example, the day
with
relative value -1 is "Yesterday"). Where there is not a convenient, customary word or phrase in a
particular language for a relative value, it should be omitted.
Here are examples for English and German. Notice that the German has more fields than the English does.
<calendar>
<fields>
...
<field type='day'>
<displayName>Day</displayName>
<relative type='-1'>Yesterday</relative>
<relative type='0'>Today</relative>
<relative type='1'>Tomorrow</relative>
</field>
...
</fields>
</calendars>
<calendar>
<fields>
...
<field type='day'>
<displayName>Tag</displayName>
<relative type='-2'>Vorgestern</relative>
<relative type='-1'>Gestern</relative>
<relative type='0'>Heute</relative>
<relative type='1'>Morgen</relative>
<relative type='2'>Übermorgen</relative>
</field>
...
</fields>
</calendars>
<!ELEMENT timeZoneNames (alias | (hourFormat*, hoursFormat*, gmtFormat*,
regionFormat*, fallbackFormat*, abbreviationFallback*, preferenceOrdering*, singleCountries*,
default*, zone*, special*)) >
<!ELEMENT zone (alias | ( long*, short*, exemplarCity*, special*)) >
The timezone IDs (tzid) are language-independent, and follow the TZ timezone database [Olson]. However, the display names for those IDs can vary by locale. The generic time is so-called wall-time; what clocks use when they are correctly switched from standard to daylight time at the mandated time of the year.
Unfortunately, the canonical tzid's (those in zone.tab) are not stable: may change in each release of the TZ Timezone database. In CLDR, however, stability of identifiers is very important. So the canonical IDs in CLDR are kept stable as described in Appendix L: Canonical Form.
The following is an example of timezone data. Although this is an example of possible data, in most cases only the exemplarCity is needs translation. And that does not even need to be present, if a country only has a single timezone. As always, the type field for each zone is the identification of that zone. It is not to be translated.
<zone type="America/Los_Angeles" > <long> <generic>Pacific Time</generic> <standard>Pacific Standard Time</standard> <daylight>Pacific Daylight Time</daylight> </long> <short> <generic>PT</generic> <standard>PST</standard> <daylight>PDT</daylight> </short> <exemplarCity>San Francisco</exemplarCity> </zone> <zone type="Europe/London"> <long> <generic>British Time</generic> <standard>British Standard Time</standard> <daylight>British Daylight Time</daylight> </long> <exemplarCity>York</exemplarCity> </zone>
Note: Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit neutral-format date/time information, commonly in UTC, and localize as close to the user as possible. (For more about UTC, see [UTCInfo].)
The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time (sometimes called wall time) to UTC and back is the TZ Data [Olson], used by Linux, UNIX, Java, ICU, and others. The data includes rules for matching the laws for time changes in different countries. For example, for the US it is:
"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one hour..." (United States Law - 15 U.S.C. §6(IX)(260-7)).
Each region that has a different timezone or daylight savings time rules, either
now or at any time back to 1970, is given a unique internal ID, such as
Europe/Paris. (Some IDs are also distinguished on the basis of differences before
1970.) As with currency codes, these are internal codes. A localized string
associated with these is provided for users (such as in the Windows Control
Panels>Date/Time>Time Zone).
Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of timezone regions and the rules for daylight savings. Thus the TZ data is continually being augmented. Any two implementations using the same version of the TZ data will get the same results for the same IDs (assuming a correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the TZ ID and the TZ data version must be transmitted between the different implementations.
For more information, see [Data Formats].
The following subelements of timezoneNames are used to control the fallback process described in Appendix J: Time Zone Display Names.
| Element Name | Data Examples | Results/Comment |
|---|---|---|
| hourFormat | "+HHmm;-HHmm" | "+1200" |
| "-1200" | ||
| hoursFormat | "{0}/{1}" | "-0800/-0700" |
| gmtFormat | "GMT{0}" | "GMT-0800" |
| "{0}ВпГ" | "-0800ВпГ" | |
| regionFormat | "{0} Time" | "Japan Time" |
| "Tiempo de {0}" | "Tiempo de Japón" | |
| fallbackFormat | "Tiempo de «{0}»" | "Tiempo de «Tokyo»" |
| abbreviationFallback | type="GMT" | causes any "long" match to be skipped in Timezone fallbacks |
| preferenceOrdering | type="America/Mexico_City America/Chihuahua America/New_York" | a preference ordering among modern zones |
| singleCountries | list="America/Godthab America/Santiago America/Guayaquil Europe/Madrid Pacific/Auckland Pacific/Tahiti Europe/Lisbon..." | uses country name alone |
<!ELEMENT numbers (alias | (symbols?, decimalFormats?, scientificFormats?, percentFormats?, currencyFormats?, currencies?, special*)) >
The numbers element supplies information for formatting and parsing numbers and currencies. It has the following sub-elements: <symbols>, <decimalFormats>, <scientificFormats>, <percentFormats>, <currencyFormats>, and <currencies>. The currency IDs are from [ISO4217] (plus some additional common-use codes). For more information, including the pattern structure, see Appendix G: Number Pattern Format.
<!ELEMENT symbols (alias | (decimal?, group?, list?, percentSign?, nativeZeroDigit?, patternDigit?, plusSign?, minusSign?, exponential?, perMille?, infinity?, nan?, special*)) >
<symbols>
<decimal>.</decimal>
<group>,</group>
<list>;</list>
<percentSign>%</percentSign>
<nativeZeroDigit>0</nativeZeroDigit>
<patternDigit>#</patternDigit>
<plusSign>+</plusSign>
<minusSign>-</minusSign>
<exponential>E</exponential>
<perMille>‰</perMille>
<infinity>∞</infinity>
<nan>☹</nan>
</symbols>
<!ELEMENT decimalFormats (alias | (default?, decimalFormatLength*,
special*))>
<!ELEMENT decimalFormatLength (alias | (default?, decimalFormat*, special*))>
<!ELEMENT decimalFormat (alias | (pattern*, special*)) >
(scientificFormats, percentFormats, and currencyFormats have the same structure)
<decimalFormats> <decimalFormatLength type="long"> <decimalFormat> <pattern>#,##0.###</pattern> </decimalFormat> </decimalFormatLength> </decimalFormats>
<scientificFormats> <default type="long"/> <scientificFormatLength type="long"> <scientificFormat> <pattern>0.000###E+00</pattern> </scientificFormat> </scientificFormatLength> <scientificFormatLength type="medium"> <scientificFormat> <pattern>0.00##E+00</pattern> </scientificFormat> </scientificFormatLength> </scientificFormats>
<percentFormats> <percentFormatLength type="long"> <percentFormat> <pattern>#,##0%</pattern> </percentFormat> </percentFormatLength> </percentFormats>
<currencyFormats> <currencyFormatLength type="long"> <currencyFormat> <pattern>¤ #,##0.00;(¤ #,##0.00)</pattern> </currencyFormat> </currencyFormatLength> </currencyFormats>
<!ELEMENT currency (alias | (pattern*, displayName*, symbol*, pattern*, decimal*, group*, special*)) >
Note: pattern appears twice in the above. The first is for consistency with all other cases of pattern + displayName; the second is for backwards compatibility.
<currencies>
<currency type="USD">
<displayName>Dollar</displayName>
<symbol>$</symbol>
</currency>
<currency type ="JPY">
<displayName>Yen</displayName>
<symbol>¥</symbol>
</currency>
<currency type ="INR">
<displayName>Rupee</displayName>
<symbol choice="true">0≤Rf|1≤Ru|1<Rf</symbol>
</currency>
<currency type="PTE">
<displayName>Escudo</displayName>
<symbol>$</symbol>
</currency>
</currencies>
In formatting currencies, the currency number format is used with the appropriate symbol from <currencies>, according to the currency code. The <currencies> list can contain codes that are no longer in current use, such as PTE. The choice attribute can be used to indicate that the value uses a pattern interpreted as in Appendix H: Choice Patterns.
When the currency symbol is substituted into a pattern, there may be some further modifications, according to the following.
<currencySpacing>
<beforeCurrency>
<currencyMatch>[:letter:]</currencyMatch>
<surroundingMatch>[:digit:]</surroundingMatch>
<insertBetween> </insertBetween>
</beforeCurrency>
<afterCurrency>
<currencyMatch>[:letter:]</currencyMatch>
<surroundingMatch>[:digit:]</surroundingMatch>
<insertBetween> </insertBetween>
</afterCurrency>
</currencySpacing>
This element controls whether additional characters are inserted on the boundary between the symbol and the pattern. For example, in the above, inserting the symbol "US$" into the pattern "#,##0.00¤" would result in an extra no-break space inserted before the symbol, eg "#,##0.00 US$", while inserting into the pattern "¤#,##0.00" would not, eg "US$#,##0.00". That is because the afterCurrency condition matches and the beforeCurrency condition doesn't. For more information on the matching used in the currencyMatch and surroundingMatch elements, see Appendix E: Unicode Sets.
Currencies can also contain optional grouping, decimal data, and pattern elements. This data is inherited from the <symbols> in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.
Note: Currency values should never be interchanged without a known currency code. You never want the number 3.5 interpreted as $3.5 by one user and ¥3.5 by another. Locale data contains localization information for currencies, not a currency value for a country. A currency amount logically consists of a numeric value, plus an accompanying currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if the raw numeric value is transmitted without any context, then it has no definitive interpretation.
Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount of <RUR, 1.23457×10³> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) cyrillic small letter er). For an English user it would be localized into the string "Rub 1,234.57"