Time Zone Localization

Draft, MED 2004-07-10: Incorporated feedback from last CLDR meeting. Changes marked in yellow.

This is a document for consideration by the CLDR Technical Committee of the Unicode Consortium. It is also being circulated to the tz mailing group for comment, since the Olson time zone database is used as the source for time zone identifiers and computation rules.

LDML currently provides a mechanism for localizing Olson Time Zone Identifiers (Olson TZIDs) in CLDR. People can supply 6 different translations per language per ID, plus an exemplar city. For example, in English, for America/Los_Angeles these can be translated as "Pacific Time" ("PT"), "Pacific Standard Time" ("PST"), "Pacific Daylight Time" ("PDT") and "Los Angeles". These translations mark a difference between "generic" usage (aka "wall time") like "Pacific Time" and a fixed offset from UTC like "Pacific Standard Time" or "Pacific Daylight Time", and also allow for both abbreviated and full names.

Here is an example for one of these translations:

<ldml><dates><timeZoneNames><zone type="Europe/Bucharest"><long><daylight>
"Eastern European Daylight Time": ·en· ·nb· ·no·
"Heure Avancée de l’Europe de l’Est": ·fr·
"Hora de verano de Europa del Este": ·es·
"Horário de Verão da Europa Oriental": ·pt·
"Itä-Euroopan kesäaika": ·fi·
"Oost-Europese zomertijd": ·nl·
"Ora Legale Europa Orientale": ·it·
"Östeuropa, sommartid": ·sv·
"Østeuropæisk sommertid": ·da·
"东欧夏令时间": ·zh·
"東欧夏時間": ·ja·
"東歐日光節約時間": ·zh_Hant·
"동부유럽 기준시": ·ko·
...

Note: the above are simply examples. The translation for a time zone identifier does not have to follow this pattern -- it can translate the city name or provide a more general description; it should be whatever is most customary and understandable for the target language in question.

The purpose for having translated timezone identifiers is to allow people with different languages to be able to recognized and distinguish the zones, to:

Why not use just GMT-0800 format? Very briefly, it's because that format does not accurately represent the situation. America/Los_Angeles, for example, is most of the year on GMT-0700, and part of the year on GMT-0800. If you pick one or the other alone, you have the wrong result. (These days it's more technically accurate to write "UTC" instead of "GMT"; however, for translation purposes the term GMT may be more familiar to people, and we won't distinguish between the two in this document.) Note: the Olson TZIDs uses the opposite sign as RFC 822 with GMT formats: Etc/GMT+8 = GMT-0800.

The Problems

There are 558 Olson TZIDs as in used in CLDR. They are organized by cities -- roughly; there are a number of old TZIDs retained for compatibility (and Java adds some old, mistaken TZIDs of its own). Each country maps to a set of one or more zones, unique to that country. The database has alias links between compatibility TZIDs and "real" TZIDs. The TZIDs are slightly changed from Olson in having "_" substituted for spaces.

There are a few flaws in this structure; there are ISO country IDs that don't have any associated Olson TZIDs (YU, BV, HM), and you can't use all the aliases or more countries miss Olson TZIDs. For the purposes of this document, we will assume that a "repaired" database is being used, and that all zones with no country map to the "ZZ" code (a private use code in ISO 3166).

Anyway, a cleaned up list amounts to 407 entries. Of these, there are still many "perpetual" equivalents: TZIDs that always produce the same result over all time (e.g. Europe/Rome = Europe/Vatican). If we picked one exemplar from each of these sets of equivalents, we end up with 385 different exemplars.

We can also distinguish "modern" equivalents; those that produce the same result for the current year and the foreseeable near future. If we picked one exemplar from each of these sets of aliases, there are 88 exemplars. There are also some "suspicious" TZIDs like WET, CET, MET, EET, Asia/Riyadh87, Asia/Riyadh88, Asia/Riyadh89, which may lower the number slightly if removed. For example, in Canada there are the following TZIDs. Each of the items separated by commas are modern equivalents, and all within the same country (Canada). Thus America/Dawson, America/Whitehorse, America/Vancouver are not distinguished by country, and all behave the same nowadays.

The zone_log.html provides a breakdown of various types of information from the Olson time zone database. It is based off of the current Java data, so it may be slightly out of date, and does contain some older Java aliases. This is purely informational, to give a view of the timezone data, and in no way is meant to replace that data.

There are 90 different languages represented in CLDR 1.1 (not counting RFC 3066 variants, such as Hant vs. Hans). There are 487 different ISO 639 codes, currently (again, not counting some important variants, such as Hant vs. Hans). Clearly, we don't want to be in the business of representing all of the possible combinations: about 8,000 strings for modern zones with current CLDR languages; over 100,000 strings for the combinations of all Olson IDs with all ISO 639 languages.

The CLDR does provide the ability to have exemplar cities that can be used in translation, although we currently do not have much data for those at all. For example, here are translations for London

<ldml><dates><timeZoneNames><zone type="GMT"><exemplarCity>
"Londain": ·ga·
"Londen": ·nl·
"London": ·da· ·en· ·fr· ·sv·
"Londra": ·it·
"Londres": ·es· ·pt·
"Lontoo": ·fi·
"ロンドン": ·ja·
"伦敦": ·zh·
"倫敦": ·zh_Hant·
"런던": ·ko·
...

A great part of the motivation for this proposal is to cut down on the amount of data required, just from the sheer magnitude of the problem when you multiply the figures by the 90 languages currently in CLDR, plus the many more languages to come. Depending on city data alone would be very painful. We already have in CLDR a lot of country data, so if we can leverage that it really helps. Let's look at the figures. There are 239 countries. Of them, 210 have a single zone. Using a country name for each of them is essentially free. Of the remainder, 8 only have multiple zones historically. So the modern ones are again essentially free. Of the rest, cities might be the best way to go. We would need 99 cities for modern zone distinctions, 140 if we added historic also. If you multiply that by 90 languages it is still a lot of data, but far better than 558 x 90 we are faced with now.

Moreover, we don't currently provide a good fallback mechanism in case it is not worth the time to translate particular zones. We want to be able to make sure that even if a translation is missing, the fallback can be expressed in a form understandable to the end user. For example, we don't currently provide for the "GMT" format itself to be localized.

It is hard for us to judge exactly the priority that people in a given country will give to particular zones; our goal is to make it possible to have reasonable fallback behavior for zones that they don't want to translate, and give them guidance as to the effects of their choices. They may have different priorities, once they understand that. For example, it may be that in the Ukraine a lot of business is done with Russia, so it is worthwhile to translate all the Russian zones in detail; but for Australia they may depend on the fallback policy.

Proposal

Here is a proposal for how to deal with these problems. The goal is to leverage the fact that CLDR already has country translations; that there is a high degree of correlation between countries and timezones; that denoting timezones by reference to countries is fairly common.

A. Fallback procedure

We'll start with a definition.

Offset-List for a TimeZone (modern). Start with an empty list. Add the GMT offset on December 31 at 23:59:59. Walk backwards through the rest of the year, and if the offset ever changes, and the offset is not in the list, then add that offset to the list. Thus for America/Los_Angeles, the list is <-8, -7>. For Australia/Melbourne it is <11, 10>, and for Africa/Algiers it is the one-element list <1>.

Here is a proposed lookup mechanism for wall time. It uses formats that are explained below. The examples are purely for illustration, and don't represent any particular language. Examples are in italic following the steps.

  1. Canonicalize the Olson ID, mapping to one of the 407 "real" IDs.
  2. If there is an exact translation, use it. Note that this translation may not at all be literal, it would be what is most recognizable for people using the target language.
  3. Else for non-wall-time, use GMT format
    1. America/Los_Angeles => "GMT-08:00"
  4. * Else if there is an exemplar city, use it with the region format. The exemplar city may not be the same as the Olson ID city, if another city is much more recognizable for whatever reason. However, it is very strongly recommended that the same city be used.
  5. * Else if there is a country for the time zone, and a translation in the locale for the country name, and the country only has one (modern) timezone, use it with the region format :
  6. Else if it is a perpetual alias for a "real" ID, and if there is an exact translation for that, try #1..#4 with that alias.
  7. * Else fall back to the raw Olson ID (stripping off the prefix, and turning _ into space), using the fallback format. 
  8. Else use the (possibly multi-offset) GMT format

There will be an attribute in the locale data that controls the above process. It is an element in:

<timeZoneNames>
  <abbreviationFallback type="GMT"> // causes any "long" match (* above) to be skipped

or

<timeZoneNames>
  <abbreviationFallback type="long"> // includes all steps above

Once this is put into place the translators have a clear strategy. They always need to translate the new format strings. Wherever the results of the above algorithm are inadequate, they can translate the precise Olson ID, any of the 407 "real" IDs, to get an exact result. If exact translations are needed, they can be prioritized: first the modern aliases, then the "perpetual" aliases, then the rest.

B. Format Strings

The above introduces some new format strings, which would be added to CLDR. All but the first are regular MessageFormat strings. This set would be added to the strings that would be localized for a locale. Note that it would be one set per locale, not one set per zone. Note that for some languages, some of the above choices would not be suitable; there may be grammatical interaction between the substituted elements and the rest of the pattern. To handle that case, it may be useful to have some special pattern (like "") to indicated that that choice should be skipped.

Element Name Pattern Examples Example Results
hour-format "+HHmm;-HHmm" "+1200"
"-1200"
hours-format "{0}/{1}" "-0800/-0700"
gmt-format "GMT{0}" "GMT-0800"
"{0}ВпГ" "-0800ВпГ"
region-format "{0} Time" "Japan Time"
"Tiempo de {0}" "Tiempo de Japón"
fallback-format "Tiempo de «{0}»" "Tiempo de «Tokyo»"

The hours formats are used to compose what goes into the gmt and region-gmt formats

Note that the results are semi-reversible; one cannot necessarily recover the exact time zone that one started with, but can recover a modern equivalent.

C. Syntax Characters

In addition, we need some new syntax characters for output (for input, all should be accepted). Currently we have the following (See Dates).

zzz General time zone PST/PDT
zzzz Pacific Standard Time/Pacific Daylight Time
Z RFC 822 time zone -0800

To this, we need to add a way to get short/full wall time, and to force GMT+/-0800 format. Here is a proposal:

z General time zone PT
zz Pacific Time
ZZ GMT Format GMT-0800

D. Preferred Modern IDs.

A lot of people just don't care about historic differences. To account for that in UIs, we could allow translators to pick out a smaller set of "preferred" modern exemplar timezones. One can then show a list of these timezones (and optionally with an advanced button or some other device, show the larger list). The set of modern equivalents for any point in time can be calculated, but the issue is that the choice of exemplars for those modern equivalents may differ according to locale.

Note: This is not merely an issue of ease-of-use: it is also important when one wants to specify the desired behavior in the presence of likely future changes.

The goal is to satisfy the above requirement, require little data to be added for each locale, and be relatively robust in the face of changes to the Olson database in the future. The proposal is to add an (optional) preference list to each locale. This preference list is a list of Olson IDs. It would work as follows, in seeing whether Olson ID x is preferred to Olson ID y, where they are both modern equivalents!

Thus a list for Mexico could simply consist of two items: "America/Mexico_City". This would have the following results (for the modern-equivalent IDs in Mexico:

Exemplar Modern Equivalents
America/Mexico_City America/Merida, America/Monterrey, America/Cancun
America/Chihuahua America/Mazatlan
America/Hermosillo  
America/Tijuana  

Note that Chihuahua is preferred over Mazatlan because it is alphabetically prior. The suggested Root preferences are listed in boldface in the equivalent modern zones table.

The proposed XML representation of this in LDML is simply an attribute value with items separated by spaces, e.g.

<timeZoneNames preferenceOrdering="America/Mexico_City America/Chihuahua America/New_York">

E. Conversion

The current CLDR has translations for a number of locales. However, the IDs that are used as keys may not be "real" IDs, so we would need to convert them to real IDs. We may also want to do a slight modification of the IDs to use only invariant characters in them.

F. Requests for the Olson Timezone Database

It would be assist the CLDR efforts if there were minor modifications in the explicit data in the Olson time zone database.

  1. A list of the set of links to not skip (because then there would be country IDs with no TZID). See skipped aliases.
    > Atlantic/Jan_Mayen
    > Europe/Bratislava
    > Europe/Ljubljana
    > Europe/San_Marino
    > Europe/Sarajevo
    > Europe/Skopje
    > Europe/Vatican
    > Europe/Zagreb
  2. The addition of unique TZIDs corresponding to the 'missing' ISO country codes BV, HM (so that every ISO country code, no matter how obscure, maps to at least one TZID)
  3. A mapping from some private-use ISO country code to the Etc/GMT* TZIDs (the above suggested ZZ, but any one of the following are available: AA, QM-QZ, XA-XZ and ZZ).
  4. A document in Olson time zone database that describes the types of IDs, and how to determine the set of IDs and their status (e.g. compatibility TZIDs, other linked TZIDs, 'regular' TZIDs). The CLDR group should be prepared to help contribute to this.

Issues:

  1. We could also introduce "area" IDs, like "Central America", or "West Africa". We could then add a step above:
  2. We could restrict the rules that call for "modern" equivalence to be perpetual equivalence. That would increase the roundtrip accuracy for IDs with translations, but reduce it for the fallback GMT-style.
  3. If we need to fallback, and we are asking for an abbreviated timezone, we have two choices: fallback in the same way as the full timezone (e.g. to "United States Time, GMT-0600/-0500") or go right to GMT format. Or, we could add abbreviated country names, and an abbreviated format, so that one could have e.g. "JPT" for "Japan Time" (English example).
  4. It may be better to nuke #4.2 and #5, e.g. dropping the GMT formatting, especially where there is more than one offset. GMT format, when there is no daylight savings, does not lose any information (nowadays). Where there is daylight, it does lose information -- although actually not much -- but avoids the problem of using cities that may either be unknown to the user or not in a script s/he can read. The only place where it is ambiguous (within a country) is if you have two zones that have the same summer & winter offsets, but start at different times. That is pretty rare. (Across countries, or historically, it is not quite so rare.)
  5. Others?