Re: Processing Digit Variants

From: Steven R. Loomis <srl_at_icu-project.org>
Date: Tue, 19 Mar 2013 22:13:50 -0700

On Tuesday, March 19, 2013, Richard Wordingham wrote:

> On Mon, 18 Mar 2013 17:28:30 -0700
> "Steven R. Loomis" <srl_at_icu-project.org <javascript:;>> wrote:
> > On Monday, March 18, 2013, Richard Wordingham wrote:
>
> > > The issue is rather with emphatically plain text <U+0031, U+FE0E,
> > > U+0032, U+FE0E>.
>
> > It's the same situation to something like an implementation of LDML
> > number parsing. U+FE0E is not part of a number.
>
> I agree that the same arguments are applicable to both parsing and
> collating, though not necessarily with equal force.
>
> Formally, <U+0031, U+FE0E, U+0032, U+FE0E> seems to be just as much a
> number as <U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO>,
> which the current LDML semantics do treat on an even footing with
> "12". If the emoji digits had been encoded as new characters, ICU
> would support them without batting an eyelid. Because the difference
> does not merit full characterhood, they are encoded by a sequence
> rather than a single character. Remember, all that U+FE0E does is to
> request a particular glyph. In a sense, we have 20 new decimal digits,
> <U+0030, U+FE0E> to <U+0039, U+FE0F> and <U+0030, U+FE0F> to <U+0039,
> U+FE0F>.
>
> So, why do you consider <U+0031, U+FE0E, U+0032, U+FE0E> not to be
> a valid decimal number?
>

Richard,
 For parse, it's pretty simple: U+0031 has a Unicode digit value. U+FE0E
does not. ( Nor is it part of the defined numbering systems in LDML -
see http://unicode.org/reports/tr35/#Numbering
System Data )
 So, U+FE0E is the end of the sequence - not a number. End of parsing.

> > > 10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
> > > likely to be rendered by a cursive Latin font
>

It's not reasonable for numeric parsing, however.

> > Identifying such an edge case does not prove that numeric tailoring is
> > broken.
>
> An 'edge case' is often just a case that shows that an algorithm that
> often works has not been thought through thoroughly. Now, as CLDR
> seems to value speed above perfect correctness, perhaps handling
> variation sequences will be rejected on that basis. All I was trying
> to find out on this list was whether <U+0031, U+FE0E, U+0032, U+FE0E>
> should be regarded as a proper number.
>

I would say that U+0031, U+FE0E is not a proper or correct number, for
purposes of parsing. In the ICU implementation, Numeric collation makes use
of numeric parsing in order to determine ordering.

In practice in ICU, for what it's worth, "U+0031, U+FE0E, U+0031" sorts
before "U+0031, U+0030".

> Special characters intended for just one aspect of text processing
> should not affect other aspects. Unfortunately, a parametric tailoring
> to ignore irrelevant characters while complying with the UCA is not
> quite as simple as just ignoring them. The issues arise with the
> blocking of discontiguous contractions and the possibility that, for
> example, one might wish to collate character variants differently. On
> the other hand, ignoring variation selectors by default might be
> excusable, for they should not occur where they might block canonical
> reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).
>
> Richard.
>
Received on Wed Mar 20 2013 - 00:17:41 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 20 2013 - 00:17:49 CDT