LDML Canonical Form

MED, 2004-12-21

To allow for simple comparison of LDML files, especially for vetting mechanical changes, we should have a canonical form for those files. XML files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.

See http://www.unicode.org/reports/tr35/

Here is a proposal.

Textual Content

  1. All start elements are on their own line, indented by depth tabs.
  2. All end elements (except for leaf nodes) are on their own line, indented by depth tabs.
  3. Any leaf node with empty content is in the form <foo/>.
  4. There are no blank lines except within comments or content.
  5. Spaces are used within a start element. There are no extra spaces within elements.
  6. All attribute values use double quote ("), not single (').
  7. There are no CDATA sections, and no escapes except those absolutely required.
  8. All attributes with defaulted values are suppressed. See the Defaulted Attributes Table

Example:

<ldml draft="true" >
	<identity>
		<version number="1.2"/>
		<generation date="2004-06-04"/>
		<language type="en"/>
		<territory type="AS"/>
	</identity>
	<numbers>
		<currencyFormats>
			<currencyFormatLength>
				<currencyFormat>
					<pattern>�#,##0.00;(�#,##0.00)</pattern>
				</currencyFormat>
			</currencyFormatLength>
		</currencyFormats>
	</numbers>
</ldml>

Ordering

  1. Element names are ordered by the Element Order Table
  2. Attribute names are ordered by the Attribute Order Table
  3. Attribute value comparison is a bit more complicated, and may depend on the attribute and type. Compare two values by using the following steps:
    1. If two values are in the Value Order Table, compare according to the order in the table. Otherwise if just one is, it goes first.
    2. If two values are numeric [0-9], compare numerically (2 < 12). Otherwise if just one is numeric, it goes first.
    3. Otherwise values are ordered alphabetically
  4. An attribute-value pair is ordered first by attribute name, and then if the attribute names are identical, by the value.
  5. An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs (sorted by #4). For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, etc.
  6. Any future additions to the DTD must be structured so as to allow compatibility with this ordering.
  7. See also Appendix K: Valid Attribute Values

Comments

  1. Comments are of the form <!-- stuff -->.
  2. They are logically attached to a node. There are 4 kinds:
    1. Inline always appear after a leaf node, on the same line at the end. These are a single line.
    2. Preblock comments always precede the attachment node, and are indented on the same level.
    3. Postblock comments always follow the attachment node, and are indented on the same level.
    4. Final comment, after </ldml>
  3. Multiline comments (except the final comment) have each line after the first indented to one deeper level.

Examples:

<eraAbbr>
	<era type="0">BC</era> <!-- might add alternate BDE in the future -->
...
<timeZoneNames>
	<!-- Note: zones that don't use daylight time need further work --> 
	<zone type="America/Los_Angeles">
	...
	<!-- Note: the following is known to be sparse,
		and needs to be improved in the future -->
	<zone type="Asia/Jerusalem">

Canonicalization

The process of canonicalization is fairly straightforward, except for comments. Inline comments will have any linebreaks replaced by a space. There may be cases where the attachment node is not permitted, such as the following.

		</dayWidth>
		<!-- some comment -->
	</dayContext>
</days>

In those cases, the comment will be made into a block comment on the last previous leaf node, if it is at that level or deeper. (If there is one already, it will be appended, with a line-break between.) If there is no place to attach the node (for example, as a result of processing that removes the attachment node), the comment and its node's xpath will be appended to the final comment in the document.

Multiline comments will have leading tabs stripped, so any indentation should be done with spaces.


Element Order Table

The organization into bullets is purely for clarity; the ordering is established by which comes first in the overall list. Note that most combinations of pairs of items will never be peer elements, and thus never be compared.

Attribute Order Table

The organization into bullets is purely for clarity; the ordering is established by which comes first in the overall list. Note that most combinations of pairs of items will never be peer elements, and thus never be compared.

Value Order Table

weekendStart day sun, mon, tue, wed, thu, fri, sat
weekendEnd
day type
dateFormatLength full, long, medium, short
timeFormatLength
dateTimeFormatLength
decimalFormatLength
scientificFormatLength
percentFormatLength
currencyFormatLength
monthWidth wide, abbreviated, narrow
dayWidth
field era, year, month, week, day, weekday, dayperiod, hour, minute, second, zone
zone The order for prefixes are: America, Atlantic, Europe, Africa, Asia, Indian, Australia, Pacific, Arctic, Antarctica, Etc. Within the same prefix, sort first by longitude, then latitude (both given by the zone.tab file in the Olson database), then by full tzid.
numeric order
alphabetic order

Defaulted Values Table

ldml version "1.2"
orientation characters "left-to-right"
lines "top-to-bottom"
weekendStart time "00:00"
weekendEnd "24:00"
dateFormat type "standard"
timeFormat
dateTimeFormat
decimalFormat
scientificFormat
percentFormat
currencyFormat
pattern
currency
collation