Lenient Date/Time/Number Parsing

There is a certain amount of leniency built-in for numbers and dates, but not as much as we'd like. For example, here is what happens with some patterns (locale="en").

pattern="yyyy-MM-dd"	numeric month
"2004-01-01" "2004-1-1" "00002004-000001-00001" "２００４-１-１" "٢٠٠٤-١-١" " 2004- 1- 1" " 2004 - 1 - 1" "2004/1/1" "2004-Jan-1" "2004- january-1" , extra space	exact match missing zeros extra zeros wide characters arabic digits extra space before digits fails: extra space before separators fails: different separator fails: abbreviated month fails: full month, lowercase
pattern="yyyy-MMM-dd"	abbreviated month
"2004-Jan-1" "2004- january-1" "2004-01-01"	abbreviated month full month, lowercase, extra space fails: numeric

Some of the desired leniency can be done with no data changes, but others need additions to CLDR. I thought I'd capture here some ideas that we have had in the past for dealing with this.

Non-CLDR

We should be consistent about allowing extra or missing spaces both after fields and before.
If we fail a parse, try other data for that field: e.g. when parsing for the month, first try what is in the pattern; if that fails try the others among: numeric; narrow, short, long; stand-alone narrow, short, long
Accept common alternatives. Internally have our own cross-language equivalency sets: {whitespaces}, {apostrophes}, {periods}, etc. If a parse fails, try the other items in the equivalency sets.
- Note: this may also need to go into CLDR, if we don't want to duplicate the information across languages.
If a datetime format fails, try the other complete formats to see if any of them work. E.g.
- For date, try full, long, medium, short; date, time, and datetime.
- For number try plain, number, integer, and scientific
To control this, add Format.set/getLeniency(). Probably 3 states:
1. roundtrip (only parse exactly what formats)
2. normal (as above: flexible on the characters, trying other formats,)
3. loose (allow out-of-range values: e.g. Jan 35 = Feb 4)
We currently combine #2 and #3 together for dates (see setLenient), but I think we want to separate them, and we want to be able to control leniency on numbers too, so moving it up to the Format superclass makes sense (to me at least!)

Most of these had no real performance issues for currently acceptable input, since we only try alternatives if the parse fails.

CLDR

For numbers, it was to have a UnicodeSet of alternatives that are accepted on input. E.g. for French numbers we have:

<numbers>
<symbols>
<decimal>,</decimal>
<group> </group>
<list>;</list>
<percentSign>%</percentSign>
<nativeZeroDigit>0</nativeZeroDigit>
<patternDigit>#</patternDigit>
<plusSign>+</plusSign>
<minusSign>-</minusSign>
<exponential>E</exponential>
<perMille>‰</perMille>
<infinity>∞</infinity>
<nan>�</nan>
</symbols>

We could add that any item could appear twice, with alternates. E.g.

<group>\u00A0 </group>
<group on="parse">[\u0020 ' ’ , ]</group>
For dates, it's a bit more complicated, since the separators are in the pattern. This may also interplay with some of Deborah's ideas for generative dates. For string fields, we could use the above, e.g.
<month type="9">sept.</month>
<month type="9" on="parse">sep.</month>

For separators, here are a couple of ideas. We might want to use the first in numbers, instead of what is above. Here is a current example (locale="fr")

<timeFormats>
<default type="medium" />
<timeFormatLength type="full">
<timeFormat type="standard">
<pattern type="standard">HH' h 'mm z</pattern>
1. Common alternatives. We would add something like the following, which would allow any of the alternatives any time a ' appeared in any time pattern.
  <input><for>'</for><accept>[’ :]</accept>
2. Modified patterns. Add some syntax that would be specific to each pattern, e.g.
  <pattern type="standard">HH'[’ :] h '[’ :]mm z</pattern>