Lenient Date/Time/Number Parsing

There is a certain amount of leniency built-in for numbers and dates, but not as much as we'd like. For example, here is what happens with some patterns (locale="en").

pattern="yyyy-MM-dd" numeric month
"2004-01-01"
"2004-1-1"
"00002004-000001-00001"
"2004-1-1"
"٢٠٠٤-١-١"
" 2004- 1- 1"

" 2004 - 1 - 1"
"2004/1/1"
"2004-Jan-1"
"2004- january-1" , extra space

exact match
missing zeros
extra zeros
wide characters
arabic digits
extra space before digits

fails: extra space before separators
fails: different separator
fails: abbreviated month
fails: full month, lowercase

pattern="yyyy-MMM-dd" abbreviated month
"2004-Jan-1"
"2004- january-1"

"2004-01-01"

abbreviated month
full month, lowercase, extra space

fails: numeric

Some of the desired leniency can be done with no data changes, but others need additions to CLDR. I thought I'd capture here some ideas that we have had in the past for dealing with this.

Non-CLDR

  1. We should be consistent about allowing extra or missing spaces both after fields and before.
  2. If we fail a parse, try other data for that field: e.g. when parsing for the month, first try what is in the pattern; if that fails try the others among: numeric; narrow, short, long; stand-alone narrow, short, long
  3. Accept common alternatives. Internally have our own cross-language equivalency sets: {whitespaces}, {apostrophes}, {periods}, etc. If a parse fails, try the other items in the equivalency sets.
  4. If a datetime format fails, try the other complete formats to see if any of them work. E.g.
  5. To control this, add Format.set/getLeniency(). Probably 3 states:
    1. roundtrip (only parse exactly what formats)
    2. normal (as above: flexible on the characters, trying other formats,)
    3. loose (allow out-of-range values: e.g. Jan 35 = Feb 4)

    We currently combine #2 and #3 together for dates (see setLenient), but I think we want to separate them, and we want to be able to control leniency on numbers too, so moving it up to the Format superclass makes sense (to me at least!)

Most of these had no real performance issues for currently acceptable input, since we only try alternatives if the parse fails.

CLDR

  1. For numbers, it was to have a UnicodeSet of alternatives that are accepted on input. E.g. for French numbers we have:

    <numbers>
    <symbols>
      <decimal>,</decimal>
      <group> </group>
      <list>;</list>
      <percentSign>%</percentSign>
      <nativeZeroDigit>0</nativeZeroDigit>
      <patternDigit>#</patternDigit>
      <plusSign>+</plusSign>
      <minusSign>-</minusSign>
      <exponential>E</exponential>
      <perMille>‰</perMille>
      <infinity>∞</infinity>
      <nan>�</nan>
    </symbols>

    We could add that any item could appear twice, with alternates. E.g.

      <group>\u00A0 </group>
      <group on="parse">[\u0020 ' ’ , ]</group>
     

  2. For dates, it's a bit more complicated, since the separators are in the pattern. This may also interplay with some of Deborah's ideas for generative dates. For string fields, we could use the above, e.g.

    <month type="9">sept.</month>
    <month type="9" on="parse">sep.</month>

    For separators, here are a couple of ideas. We might want to use the first in numbers, instead of what is above. Here is a current example (locale="fr")

    <timeFormats>
      <default type="medium" />
      <timeFormatLength type="full">
        <timeFormat type="standard">
          <pattern type="standard">HH' h 'mm z</pattern>

    1. Common alternatives. We would add something like the following, which would allow any of the alternatives any time a ' appeared in any time pattern.

        <input><for>'</for><accept>[’ :]</accept>
       

    2. Modified patterns. Add some syntax that would be specific to each pattern, e.g.

      <pattern type="standard">HH'[’ :] h '[’ :]mm z</pattern>