LDML Canonical Form
MED, 2004-12-21
To allow for simple comparison of LDML files, especially for vetting mechanical changes, we
should have a canonical form for those files. XML files
can have a wide variation in textual form, while representing precisely the same data. By putting
the LDML files in the repository into a canonical form, this allows us to use the simple diff tools
used widely (and in CVS) to detect differences when vetting changes, without those tools being
confused. This is not a requirement on other uses of LDML; just simply a way to manage repository
data more easily.
See http://www.unicode.org/reports/tr35/
Here is a proposal.
Textual Content
- All start elements are on their own line, indented by depth tabs.
- All end elements (except for leaf nodes) are on their own line, indented by depth tabs.
- Any leaf node with empty content is in the form <foo/>.
- There are no blank lines except within comments or content.
- Spaces are used within a start element. There are no extra spaces within elements.
<version number="1.2"/>, not <version number = "1.2" />
</identity>, not </identity >
- All attribute values use double quote ("), not single (').
- There are no CDATA sections, and no escapes except those absolutely required.
- no ' since it is not necessary
- no 'a', it would be just 'a'
- All attributes with defaulted values are suppressed. See the
Defaulted Attributes Table
Example:
<ldml draft="true" >
<identity>
<version number="1.2"/>
<generation date="2004-06-04"/>
<language type="en"/>
<territory type="AS"/>
</identity>
<numbers>
<currencyFormats>
<currencyFormatLength>
<currencyFormat>
<pattern>�#,##0.00;(�#,##0.00)</pattern>
</currencyFormat>
</currencyFormatLength>
</currencyFormats>
</numbers>
</ldml>
Ordering
- Element names are ordered by the Element Order Table
- Attribute names are ordered by the Attribute Order Table
- Attribute value comparison is a bit more complicated, and may depend on the attribute and
type. Compare two values by using the following steps:
- If two values are in the Value Order Table, compare
according to the order in the table. Otherwise if just one is, it goes first.
- If two values are numeric [0-9], compare numerically (2 < 12). Otherwise if just one is
numeric, it goes first.
- Otherwise values are ordered alphabetically
- An attribute-value pair is ordered first by attribute name, and then if the attribute names
are identical, by the value.
- An element is ordered first by the element name, and then if the element names are identical,
by the sorted set of attribute-value pairs (sorted by #4). For the latter, compare the first pair
in each (in sorted order by attribute pair). If not identical, go to the second pair, etc.
- Any future additions to the DTD must be structured so as to allow compatibility with this
ordering.
- See also Appendix K:
Valid Attribute Values
Comments
- Comments are of the form <!-- stuff -->.
- They are logically attached to a node. There are 4 kinds:
- Inline always appear after a leaf node, on the same line at the end. These are a single
line.
- Preblock comments always precede the attachment node, and are indented on the same level.
- Postblock comments always follow the attachment node, and are indented on the same level.
- Final comment, after </ldml>
- Multiline comments (except the final comment) have each line after the first indented to one
deeper level.
Examples:
<eraAbbr>
<era type="0">BC</era> <!-- might add alternate BDE in the future -->
...
<timeZoneNames>
<!-- Note: zones that don't use daylight time need further work -->
<zone type="America/Los_Angeles">
...
<!-- Note: the following is known to be sparse,
and needs to be improved in the future -->
<zone type="Asia/Jerusalem">
Canonicalization
The process of canonicalization is fairly straightforward, except for comments. Inline comments
will have any linebreaks replaced by a space. There may be cases where the attachment node is not
permitted, such as the following.
</dayWidth>
<!-- some comment -->
</dayContext>
</days>
In those cases, the comment will be made into a block comment on the last previous leaf node, if
it is at that level or deeper. (If there is one already, it will be appended, with a line-break
between.) If there is no place to attach the node (for example, as a result of processing that
removes the attachment node), the comment and its node's xpath will be appended to the final comment
in the document.
Multiline comments will have leading tabs stripped, so any indentation should be done with
spaces.
The organization into bullets is purely for clarity; the ordering is established by which comes
first in the overall list. Note that most combinations of pairs of items will never be peer
elements, and thus never be compared.
- ldml, identity, alias, localeDisplayNames, layout, characters, delimiters, measurement, dates,
numbers, collations, posix,
- version, generation, language, script, territory, variant,
- languages, scripts, territories, variants, keys, types,
- key, type,
- orientation, exemplarCharacters, mapping, cp,
- quotationStart, quotationEnd, alternateQuotationStart, alternateQuotationEnd,
- measurementSystem, paperSize, height, width,
- localizedPatternChars, calendars, timeZoneNames,
- months, monthNames, monthAbbr, days, dayNames, dayAbbr, week, am, pm, eras, dateFormats,
timeFormats, dateTimeFormats, fields, month, day, minDays, firstDay, weekendStart, weekendEnd,
eraNames, eraAbbr, era, pattern, displayName, hourFormat, hoursFormat, gmtFormat, regionFormat,
fallbackFormat, abbreviationFallback, preferenceOrdering, default, calendar, monthContext,
monthWidth, dayContext, dayWidth, dateFormatLength, dateFormat, timeFormatLength, timeFormat,
dateTimeFormatLength, dateTimeFormat, zone, long, short, exemplarCity, generic, standard,
daylight, field, relative,
- symbols, decimalFormats, scientificFormats, percentFormats, currencyFormats, currencies,
- decimalFormatLength, decimalFormat, scientificFormatLength, scientificFormat,
percentFormatLength, percentFormat, currencyFormatLength, currencyFormat, currency, symbol,
decimal, group, list, percentSign, nativeZeroDigit, patternDigit, plusSign, minusSign,
exponential, perMille, infinity, nan,
- collation,
- messages, yesstr, nostr, yesexpr, noexpr,
- special (always last)
The organization into bullets is purely for clarity; the ordering is established by which comes
first in the overall list. Note that most combinations of pairs of items will never be peer
elements, and thus never be compared.
- type, key, registry, alt (distinguishing types)
- source, path,
- day, date,
- version, count,
- lines, characters,
- before,
- number, time,
- validSubLocales, standard, references,
- draft
| weekendStart |
day |
sun, mon, tue, wed, thu, fri, sat |
| weekendEnd |
| day |
type |
| dateFormatLength |
full, long, medium, short |
| timeFormatLength |
| dateTimeFormatLength |
| decimalFormatLength |
| scientificFormatLength |
| percentFormatLength |
| currencyFormatLength |
| monthWidth |
wide, abbreviated, narrow |
| dayWidth |
| field |
era, year, month, week, day, weekday, dayperiod, hour, minute, second, zone |
| zone |
The order for prefixes are: America, Atlantic, Europe, Africa, Asia, Indian,
Australia, Pacific, Arctic, Antarctica, Etc. Within the same prefix, sort first by
longitude, then latitude (both given by the zone.tab file in the Olson database), then by full
tzid. |
| numeric order |
| alphabetic order |
| ldml |
version |
"1.2" |
| orientation |
characters |
"left-to-right" |
| lines |
"top-to-bottom" |
| weekendStart |
time |
"00:00" |
| weekendEnd |
"24:00" |
| dateFormat |
type |
"standard" |
| timeFormat |
| dateTimeFormat |
| decimalFormat |
| scientificFormat |
| percentFormat |
| currencyFormat |
| pattern |
| currency |
| collation |