Draft, MED 2004-07-10: Incorporated feedback from last CLDR meeting. Changes marked in yellow.
This is a document for consideration by the CLDR Technical Committee of the Unicode Consortium. It is also being circulated to the tz mailing group for comment, since the Olson time zone database is used as the source for time zone identifiers and computation rules.
LDML currently provides a mechanism for localizing Olson Time Zone Identifiers (Olson TZIDs) in CLDR. People can supply 6 different translations per language per ID, plus an exemplar city. For example, in English, for America/Los_Angeles these can be translated as "Pacific Time" ("PT"), "Pacific Standard Time" ("PST"), "Pacific Daylight Time" ("PDT") and "Los Angeles". These translations mark a difference between "generic" usage (aka "wall time") like "Pacific Time" and a fixed offset from UTC like "Pacific Standard Time" or "Pacific Daylight Time", and also allow for both abbreviated and full names.
Here is an example for one of these translations:
<ldml><dates><timeZoneNames><zone type="Europe/Bucharest"><long><daylight>
"Eastern European Daylight Time": ·en· ·nb· ·no·
"Heure Avancée de l’Europe de l’Est": ·fr·
"Hora de verano de Europa del Este": ·es·
"Horário de Verão da Europa Oriental": ·pt·
"Itä-Euroopan kesäaika": ·fi·
"Oost-Europese zomertijd": ·nl·
"Ora Legale Europa Orientale": ·it·
"Östeuropa, sommartid": ·sv·
"Østeuropæisk sommertid": ·da·
"东欧夏令时间": ·zh·
"東欧夏時間": ·ja·
"東歐日光節約時間": ·zh_Hant·
"동부유럽 기준시": ·ko·
...Note: the above are simply examples. The translation for a time zone identifier does not have to follow this pattern -- it can translate the city name or provide a more general description; it should be whatever is most customary and understandable for the target language in question.
The purpose for having translated timezone identifiers is to allow people with different languages to be able to recognized and distinguish the zones, to:
Why not use just GMT-0800 format? Very briefly, it's because that format does not accurately represent the situation. America/Los_Angeles, for example, is most of the year on GMT-0700, and part of the year on GMT-0800. If you pick one or the other alone, you have the wrong result. (These days it's more technically accurate to write "UTC" instead of "GMT"; however, for translation purposes the term GMT may be more familiar to people, and we won't distinguish between the two in this document.) Note: the Olson TZIDs uses the opposite sign as RFC 822 with GMT formats: Etc/GMT+8 = GMT-0800.
There are 558 Olson TZIDs as in used in CLDR. They are organized by cities -- roughly; there are a number of old TZIDs retained for compatibility (and Java adds some old, mistaken TZIDs of its own). Each country maps to a set of one or more zones, unique to that country. The database has alias links between compatibility TZIDs and "real" TZIDs. The TZIDs are slightly changed from Olson in having "_" substituted for spaces.
There are a few flaws in this structure; there are ISO country IDs that don't have any associated Olson TZIDs (YU, BV, HM), and you can't use all the aliases or more countries miss Olson TZIDs. For the purposes of this document, we will assume that a "repaired" database is being used, and that all zones with no country map to the "ZZ" code (a private use code in ISO 3166).
Anyway, a cleaned up list amounts to 407 entries. Of these, there are still many "perpetual" equivalents: TZIDs that always produce the same result over all time (e.g. Europe/Rome = Europe/Vatican). If we picked one exemplar from each of these sets of equivalents, we end up with 385 different exemplars.
We can also distinguish "modern" equivalents; those that produce the same result for the current year and the foreseeable near future. If we picked one exemplar from each of these sets of aliases, there are 88 exemplars. There are also some "suspicious" TZIDs like WET, CET, MET, EET, Asia/Riyadh87, Asia/Riyadh88, Asia/Riyadh89, which may lower the number slightly if removed. For example, in Canada there are the following TZIDs. Each of the items separated by commas are modern equivalents, and all within the same country (Canada). Thus America/Dawson, America/Whitehorse, America/Vancouver are not distinguished by country, and all behave the same nowadays.
The zone_log.html provides a breakdown of various types of information from the Olson time zone database. It is based off of the current Java data, so it may be slightly out of date, and does contain some older Java aliases. This is purely informational, to give a view of the timezone data, and in no way is meant to replace that data.
There are 90 different languages represented in CLDR 1.1 (not counting RFC 3066 variants, such as Hant vs. Hans). There are 487 different ISO 639 codes, currently (again, not counting some important variants, such as Hant vs. Hans). Clearly, we don't want to be in the business of representing all of the possible combinations: about 8,000 strings for modern zones with current CLDR languages; over 100,000 strings for the combinations of all Olson IDs with all ISO 639 languages.
The CLDR does provide the ability to have exemplar cities that can be used in translation, although we currently do not have much data for those at all. For example, here are translations for London
<ldml><dates><timeZoneNames><zone type="GMT"><exemplarCity>
"Londain": ·ga·
"Londen": ·nl·
"London": ·da· ·en· ·fr· ·sv·
"Londra": ·it·
"Londres": ·es· ·pt·
"Lontoo": ·fi·
"ロンドン": ·ja·
"伦敦": ·zh·
"倫敦": ·zh_Hant·
"런던": ·ko·
...
A great part of the motivation for this proposal is to cut down on the amount of data required, just from the sheer magnitude of the problem when you multiply the figures by the 90 languages currently in CLDR, plus the many more languages to come. Depending on city data alone would be very painful. We already have in CLDR a lot of country data, so if we can leverage that it really helps. Let's look at the figures. There are 239 countries. Of them, 210 have a single zone. Using a country name for each of them is essentially free. Of the remainder, 8 only have multiple zones historically. So the modern ones are again essentially free. Of the rest, cities might be the best way to go. We would need 99 cities for modern zone distinctions, 140 if we added historic also. If you multiply that by 90 languages it is still a lot of data, but far better than 558 x 90 we are faced with now.
Moreover, we don't currently provide a good fallback mechanism in case it is not worth the time to translate particular zones. We want to be able to make sure that even if a translation is missing, the fallback can be expressed in a form understandable to the end user. For example, we don't currently provide for the "GMT" format itself to be localized.
It is hard for us to judge exactly the priority that people in a given country will give to particular zones; our goal is to make it possible to have reasonable fallback behavior for zones that they don't want to translate, and give them guidance as to the effects of their choices. They may have different priorities, once they understand that. For example, it may be that in the Ukraine a lot of business is done with Russia, so it is worthwhile to translate all the Russian zones in detail; but for Australia they may depend on the fallback policy.
Here is a proposal for how to deal with these problems. The goal is to leverage the fact that CLDR already has country translations; that there is a high degree of correlation between countries and timezones; that denoting timezones by reference to countries is fairly common.
We'll start with a definition.
Offset-List for a TimeZone (modern). Start with an empty list. Add the GMT offset on December 31 at 23:59:59. Walk backwards through the rest of the year, and if the offset ever changes, and the offset is not in the list, then add that offset to the list. Thus for America/Los_Angeles, the list is <-8, -7>. For Australia/Melbourne it is <11, 10>, and for Africa/Algiers it is the one-element list <1>.
Here is a proposed lookup mechanism for wall time. It uses formats that are explained below. The examples are purely for illustration, and don't represent any particular language. Examples are in italic following the steps.
There will be an attribute in the locale data that controls the above process. It is an element in:
<timeZoneNames>
<abbreviationFallback type="GMT"> // causes any "long" match (* above) to be skipped
or
<timeZoneNames>
<abbreviationFallback type="long"> // includes all steps above
Once this is put into place the translators have a clear strategy. They always need to translate the new format strings. Wherever the results of the above algorithm are inadequate, they can translate the precise Olson ID, any of the 407 "real" IDs, to get an exact result. If exact translations are needed, they can be prioritized: first the modern aliases, then the "perpetual" aliases, then the rest.
The above introduces some new format strings, which would be added to CLDR. All but the first are regular MessageFormat strings. This set would be added to the strings that would be localized for a locale. Note that it would be one set per locale, not one set per zone. Note that for some languages, some of the above choices would not be suitable; there may be grammatical interaction between the substituted elements and the rest of the pattern. To handle that case, it may be useful to have some special pattern (like "") to indicated that that choice should be skipped.
| Element Name | Pattern Examples | Example Results |
|---|---|---|
| hour-format | "+HHmm;-HHmm" | "+1200" |
| "-1200" | ||
| hours-format | "{0}/{1}" | "-0800/-0700" |
| gmt-format | "GMT{0}" | "GMT-0800" |
| "{0}ВпГ" | "-0800ВпГ" | |
| region-format | "{0} Time" | "Japan Time" |
| "Tiempo de {0}" | "Tiempo de Japón" | |
| fallback-format | "Tiempo de «{0}»" | "Tiempo de «Tokyo»" |
The hours formats are used to compose what goes into the gmt and region-gmt formats
Note that the results are semi-reversible; one cannot necessarily recover the exact time zone that one started with, but can recover a modern equivalent.
In addition, we need some new syntax characters for output (for input, all should be accepted). Currently we have the following (See Dates).
zzz |
General time zone | PST/PDT |
zzzz |
Pacific Standard Time/Pacific Daylight Time |
|
Z |
RFC 822 time zone | -0800 |
To this, we need to add a way to get short/full wall time, and to force GMT+/-0800 format. Here is a proposal:
z |
General time zone | PT |
zz |
Pacific Time |
|
ZZ |
GMT Format | GMT-0800 |
A lot of people just don't care about historic differences. To account for that in UIs, we could allow translators to pick out a smaller set of "preferred" modern exemplar timezones. One can then show a list of these timezones (and optionally with an advanced button or some other device, show the larger list). The set of modern equivalents for any point in time can be calculated, but the issue is that the choice of exemplars for those modern equivalents may differ according to locale.
Note: This is not merely an issue of ease-of-use: it is also important when one wants to specify the desired behavior in the presence of likely future changes.
The goal is to satisfy the above requirement, require little data to be added for each locale, and be relatively robust in the face of changes to the Olson database in the future. The proposal is to add an (optional) preference list to each locale. This preference list is a list of Olson IDs. It would work as follows, in seeing whether Olson ID x is preferred to Olson ID y, where they are both modern equivalents!
Thus a list for Mexico could simply consist of two items: "America/Mexico_City". This would have the following results (for the modern-equivalent IDs in Mexico:
| Exemplar | Modern Equivalents |
| America/Mexico_City | America/Merida, America/Monterrey, America/Cancun |
| America/Chihuahua | America/Mazatlan |
| America/Hermosillo | |
| America/Tijuana |
Note that Chihuahua is preferred over Mazatlan because it is alphabetically prior. The suggested Root preferences are listed in boldface in the equivalent modern zones table.
The proposed XML representation of this in LDML is simply an attribute value with items separated by spaces, e.g.
<timeZoneNames preferenceOrdering="America/Mexico_City America/Chihuahua America/New_York">
The current CLDR has translations for a number of locales. However, the IDs that are used as keys may not be "real" IDs, so we would need to convert them to real IDs. We may also want to do a slight modification of the IDs to use only invariant characters in them.
It would be assist the CLDR efforts if there were minor modifications in the explicit data in the Olson time zone database.
Issues: