There is a certain amount of leniency built-in for numbers and dates, but not as much as we'd like. For example, here is what happens with some patterns (locale="en").
pattern="yyyy-MM-dd" | numeric month |
"2004-01-01" "2004-1-1" "00002004-000001-00001" "2004-1-1" "٢٠٠٤-١-١" " 2004- 1- 1" " 2004 - 1 - 1" |
exact match missing zeros extra zeros wide characters arabic digits extra space before digits fails: extra space before separators |
pattern="yyyy-MMM-dd" | abbreviated month |
"2004-Jan-1" "2004- january-1" "2004-01-01" |
abbreviated month full month, lowercase, extra space fails: numeric |
Some of the desired leniency can be done with no data changes, but others need additions to CLDR. I thought I'd capture here some ideas that we have had in the past for dealing with this.
We currently combine #2 and #3 together for dates (see setLenient), but I think we want to separate them, and we want to be able to control leniency on numbers too, so moving it up to the Format superclass makes sense (to me at least!)
Most of these had no real performance issues for currently acceptable input, since we only try alternatives if the parse fails.
For numbers, it was to have a UnicodeSet of alternatives that are accepted on input. E.g. for French numbers we have:
<numbers>
<symbols>
<decimal>,</decimal>
<group> </group>
<list>;</list>
<percentSign>%</percentSign>
<nativeZeroDigit>0</nativeZeroDigit>
<patternDigit>#</patternDigit>
<plusSign>+</plusSign>
<minusSign>-</minusSign>
<exponential>E</exponential>
<perMille>‰</perMille>
<infinity>∞</infinity>
<nan>�</nan>
</symbols>
We could add that any item could appear twice, with alternates. E.g.
<group>\u00A0 </group>
<group on="parse">[\u0020 ' ’ , ]</group>
<month type="9">sept.</month>
<month type="9" on="parse">sep.</month>
For separators, here are a couple of ideas. We might want to use the first in numbers, instead of what is above. Here is a current example (locale="fr")
<timeFormats>
<default type="medium" />
<timeFormatLength type="full">
<timeFormat type="standard">
<pattern type="standard">HH' h 'mm z</pattern>
<input><for>'</for><accept>[’ :]</accept>
<pattern type="standard">HH'[’ :] h '[’ :]mm z</pattern>