LDML Element Validity

Mark Davis, 2004-07-25

Latest: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/cldr/ldmlValidity.htm

The following describes in more detail how to determine the validity of the data for a given element in LDML. Part of this depends on the use of the attribute draft. If an element has draft="true", then the data is explicitly not known to be valid. ("Not known to be" because it may actually be valid, but it has not been vetted.). But it is a bit more complicated than that, since draft="true" is inherited by sub-elements, and child locales. So this document defines precisely what it is to be implicitly draft.

In addition, we are proposing a new attribute that allows separate data trees to be merged. (Working out how that attribute would work precisely was actually the impetus for this document!)

Definitions

For any element in an XML file, an element chain is a resolved XPath leading from the root to an element, with attributes on each element in alphabetical order. So in, say, http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/main/el.xml we have:

- <ldml version="1.1">
+ <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="el" />
  </identity>
- <localeDisplayNames>
- <languages>
  <language type="ar">Αραβικά</language>

...

Which gives the following element chains (among others):

An element chain A is an extension of an element chain B if the initial portion of A is the equivalent to B. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)

  1. //ldml/localeDisplayNames
  2. //ldml/localeDisplayNames/languages/language@type="ar"

An LDML file can be thought of as an ordered list of element pairs: <element chain, data>, where the element chains are all the chains for the end-nodes. (This works because LDML doesn't allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.

For example, some of those pairs would be the following. Notice that the first has the null string as element contents.

Note: Some elements are not inherited from their parent locales. For example, all of the elements in a <collation> element are part of the structure of the collation data itself. So everything in a <collation> element is treated as a single lump of data, as far as inheritance is concerned.

Two LDML element chains are equivalent when they would be identical if all attributes except the following list were removed: type, width, context. Thus the following are equivalent:

For any locale ID, an locale chain is an ordered list starting with the root and leading down to the ID. For example:

<root, de, de-DE, de-DE-xxx>

Resolved Data File

To produce fully resolved locale data file from CLDR for a locale ID L, you start with root, and replace/add items from the child locales until you get down to L. More formally, this can be expressed as the following procedure.

  1. Let Result be an empty LDML file.
  2. For each Li in the locale chain for L
    1. For each element pair P in the LDML file for Li:
      1. If Result has an element pair Q with an equivalent element chain, remove Q.
      2. Add P to Result.

Note: when adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.

Valid Data

The attribute draft='true' in LDML means that the data is not known to be valid or not. However, some data that is not explicitly marked as draft may be implicitly draft, either because it inherits it from a parent, or from an enclosing element.

Example 2. Suppose that a new locale is added for af (Afrikans). To indicate that all of the data is draft, that attribute can be added to the top level.

<ldml version="1.1" draft="true">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="af" />
 </identity>
</ldml>

Any data can be added to that file, and the status will all be draft. Once an item is vetted -- whether it is inherited or explicitly in the file -- then its status can be changed to non-draft. This can be done either by leaving draft="true" on the enclosing element and marking the child with draft="false", such as:

<ldml version="1.1" draft="true">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="af" />
 </identity>
 <characters draft="false"/>
 <localeDisplayNames/>
 <dates/>
 <numbers/>
 <collations/>
</ldml>

Or it can be done by removing the draft="false" from the enclosing element, and marking the other children as draft.

<ldml version="1.1">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="af" />
 </identity>
 <characters/>
 <localeDisplayNames/>
 <dates draft="true"/>
 <numbers draft="true"/>
 <collations draft="true"/>
</ldml>

Note: A missing draft attribute is not the same as either a true or false value. A missing attribute means instead: inherit the draft status from enclosing elements and parent locales.

This section also contains the proposed new attribute, legitimateChildren. (This is a provisional name: we'll come up with a more serious one!) This attribute allows us to mark children in a given tree that are valid, even though there is no file present. It only has an effect for locales that inherit from the current file where a file is missing, and the elements wouldn't otherwise be draft.

Example 1. Suppose that in a particular LDML tree, there are no region locales for German, e.g. there is a de.xml file, but no files for de-AT.xml, de-CH.xml, or de-DE.xml. Then no elements are valid for any of those region locales. If we want to mark one of those files as having valid elements, then we introduce an empty file, such as the following.

<ldml version="1.1">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="de" />
  <territory type="AT" />
 </identity>
</ldml>

With the legitimateChildren attribute, instead of adding the empty files for de-AT.xml, de-CH.xml, and de-DE.xml, in the de file we can add to the parent locale a list of the child locales that should behave as if files were present.

<ldml version="1.1" legitimateChildren="de-AT de-CH de-DE">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="de" />
 </identity>
...
</ldml>

More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.

Checking for Draft Status:

  1. Parent Locale Inheritance
    1. Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
    2. Produce the fully resolved data file D' for D.
    3. In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
    4. If there is no such E', return true
    5. If E' is not equivalent to E, truncate E' to the length of E.
  2. Enclosing Element Inheritance
    1. Walk through the elements in E', from back to front.
      1. If you ever encounter draft=x, return x
    2. If L' = L, return false
  3. Missing File Inheritance
    1. Otherwise, walk again through the elements in E', from back to front.
      1. If you encounter a legitimateChildren attribute:
        1. If L is in the attribute value, return false
        2. Otherwise return true
  4. Otherwise
    1. Return true

The legitimateChildren in the most specific (farthest from root file) locale file wins through the full resolution step (data from more specific files replacing data from less specific ones).

Separating Trees

To separate trees, add legitimateChildren or empty files wherever necessary. That is, wherever a file is removed that had existed in the main tree, any non-draft, validly inherited elements will need to have a legitimateChildren value added.

Merging Trees

When merging trees, draft='true' attributes need to be added whenever the merging of the trees would cause additional elements to be considered non-draft that were not in the original. In addition, legitimateChildren elements can be removed wherever they are no longer necessary (although these are merely redundant; they will have no harmful effects).