Flexible Datetime

MED, 2005-08-09

[From email of 2005-08-02]

The basic design is that a localizer can put whatever formats they want into a list in CLDR.
The thing I like about this method over anything else we discussed is that
the burden on the localizer is small, since all they have to do is provide
lists of formats; there is no extra fancy-dancy scripting, variables,
whatever else that they would need to learn.

For example, here is a list of formats for a particular locale from Open
Office.

d. MMM yy
d. MMM yyyy
d. MMMM yyyy
EEE, d. MMM yy
EEE, d. MMMM yyyy
EEEE, d. MMMM yyyy
d. MMM. yyyy
d. MMMM yyyy
EEEE, d. MMMM yyyy
yy-MM-dd
EEE dd.MMM yy
yyyy-MM-dd
ww
MM.yy
dd.MMM
MMMM
'QQ' yy // Note: this is an artifact of my translation from OO, since CLDR
doesn't yet have quarters.
dd.MM.yyyy
dd.MM.yy
dd.MM.yy
MM-dd
HH:mm:ss
hh:mm:ss a
HH:mm
hh:mm a
mm:ss,SS

This could go into a new subelement of <dateTimeFormats>, such as

<availableFormats>
  <dateFormatItem>d. MMM yy</dateFormatItem>
   ...
  <dateFormatItem>hh:mm:ss a</dateFormatItem>
    ...
</availableFormats>

We can omit any that are already listed as patterns in dateFormats or
timeFormats. Notice that this is specific to a calendar, although one
calendar can alias this to another.

With the API, the user can pass in a request script, like "ymd". That
request just contains the field letters, with the requested length. Order is
irrelevant. The list is searched, and a match is adjusted to have the same
field widths as what is input. If there is not a good match (eg missing or
extra fields), the request is broken into date and time segments, and we try
again, putting the two results together with the datetimepattern (already in
CLDR). For each of those, if we fail, we get the best match that has the
most fields and try again with the remainder. We add the remainder onto the
original, again with a message format pattern that can be localized.

What I do for matching is compute a 'distance' to each pattern, and pick the
smallest one. The total distance is the sum of the distances between the
individual fields. Currently I have a difference of 1 between different
lengths (eg mmm vs mmmmm), but a difference of 256 between numeric and
string (eg mm vs mmm), and a difference of 16 (or multiples) between
variants (like u vs y). If a field is missing or extra, a huge weight is
added.
All these weights are tunable per field, so we can modify it to suit. (I
also mechanically add all the single fields, since that isn't in the OO
data. For the single fields, the fields themselves in a 'default' length can
be used as defaults; if those didn't work a localizer could always have
explicit ones.)

The code was pretty straightforward; if you are interested you can look at
org.unicode.cldr.test.FlexibleDateTime.java (not production level code, of
course!) or
http://unicode.org/cldr/data/tools/java/org/unicode/cldr/test/FlexibleDateTime.java.
The distance calculation is in getDistance, so you can search for that to
see the code and usage (ignore the first instance).

The production code would want to cache the results, so that the next match
of the same input request (for the same locale!) would be fast.

Anyway, here is a test run (look at this with a wide window: the fields are
tab separated).

German (Germany) (de_DE)
Sample Input: Thu Dec 23 01:02:03 PST 1999
Input request:         dMMy
         Fields:         {Day:N}{Month:N}{Year:N}
         Localized Pattern:         y-MM-d
         Sample Results:         �1999-12-23�
Input request:         kh
         Fields:         {Hour:N}{Hour:N}
         Conflicting fields: k, h
Input request:         GHHmm
         Fields:         {Era:N}{Hour:N}{Minute:N}
         Localized Pattern:         G HH:mm
         Sample Results:         �n. Chr. 01:02�
Input request:         yyyyHHmm
         Fields:         {Year:N}{Hour:N}{Minute:N}
         Localized Pattern:         'QQ' yyyy HH:mm
         Sample Results:         �QQ 1999 01:02�
Input request:         Kmm
         Fields:         {Hour:N}{Minute:N}
         Localized Pattern:         h:mm a
         Sample Results:         �1:02 vorm.�
Input request:         kmm
         Fields:         {Hour:N}{Minute:N}
         Localized Pattern:         H:mm
         Sample Results:         �1:02�
Input request:         MMdd
         Fields:         {Month:N}{Day:N}
         Localized Pattern:         MM-dd
         Sample Results:         �12-23�
Input request:         ddHH
         Fields:         {Day:N}{Hour:N}
         Localized Pattern:         dd HH
         Sample Results:         �23 01�
Input request:         yyyyMMMd
         Fields:         {Year:N}{Month:N}{Day:N}
         Localized Pattern:         d. MMM yyyy
         Sample Results:         �23. Dez 1999�
Input request:         yyyyMMddHHmmss
         Fields:
{Year:N}{Month:N}{Day:N}{Hour:N}{Minute:N}{Second:N}
         Localized Pattern:         yyyy-MM-dd HH:mm:ss
         Sample Results:         �1999-12-23 01:02:03�
Input request:         GEEEEyyyyMMddHHmmss
         Fields:
{Era:N}{Weekday:N}{Year:N}{Month:N}{Day:N}{Hour:N}{Minute:N}{Second:N}
         Localized Pattern:         EEEE, dd. MM yyyy [G] HH:mm:ss
         Sample Results:         �Donnerstag, 23. 12 1999 [n. Chr.]
01:02:03�
Input request:         GuuuuMMMMwwWddDDDFEEEEaHHmmssSSSvvvv
         Fields:
{Era:N}{Year:N}{Month:N}{Week_in_Year:N}{Week_in_Month:N}{Day:N}{Day_Of_Year
:N}{Day_of_Week_on_Month:N}{Weekday:N}{Dayperiod:N}{Hour:N}{Minute:N}{Second
:N}{Fractional_Second:N}{Zone:N}
         Localized Pattern:         EEEE, dd. MMMM uuuu [ww] [G] [W] [F]
[DDD] HH:mm:ss [vvvv] [SSS]
         Sample Results:         �Donnerstag, 23. Dezember 1999 [51] [n.
Chr.] [4] [4] [357] 01:02:03 [Los Angeles (Vereinigte Staaten)] [000]�

In these, the message format for appending extra fields if some are left
over is just "{0} [{1}]" -- for testing. This shows up in the last couple
items because there are no patterns with odd combinations like
day-of-week-in-month in German; but also no era. One possibility I thought
of is for appended fields we have the option of an localized, ordered
mapping from matched fields (when appending) to message formats to use to
perform the for appending, eg

G => "{0} {1}"
w => "{0} (Woche: {1})"
...

This could be in XML as:
<availableFormats>
  <appendItems request="G">{0} {1}</appendItems>
  <appendItems request="w">{0} (Woche: {1})</appendItems>
...
</availableFormats>

The request could be more than one letter. That would let us append missing
fields onto the best match, and give as good a results as we need (assuming
that the localizers have gotten the common cases -- we can start with the
data from OpenOffice).