Re: Processing Digit Variants

From: Steven R. Loomis <srl_at_icu-project.org>
Date: Mon, 18 Mar 2013 17:28:30 -0700

Richard,

On Monday, March 18, 2013, Richard Wordingham wrote:

> On Mon, 18 Mar 2013 21:07:27 +0000
> "Whistler, Ken" <ken.whistler_at_sap.com <javascript:;>> wrote:
> > It seems to me that the more
> > significant issue here would be whether the enclosing combining marks
> > are present, whether or not any variation selectors are present. So:
> >
> > <U+0031, U+20E3, U+0032, U+20E3>
> >
> > Isn't much different, for this purpose, than:
> >
> > <U+0031, U+FE0F, U+20E3, U+0032, U+FE0F, U+20E3>
> >
> > I wouldn't really expect most processes to recognize either of those
> > sequences as "a number" for parsing purposes.
>
> Nor I, as they're not much closer that <U+2460 CIRCLED DIGIT ONE,
> U+2461 CIRCLED DIGIT TWO>. That is why I wondered if one could argue
> that their use in multi-digit numbers was not playing the game, and
> therefore one should not be surprised if things went wrong.
>

It's not much different from
<U+0031, U+26C4, U+0032>
or
<U+0031, U+0620, U+0032>

either. Parsed as a number, only <U+0031> is a number. Parsing would stop
after this codepoint.

> The issue is rather with emphatically plain text <U+0031, U+FE0E,
> U+0032, U+FE0E>.
>

It's the same situation to something like an implementation of LDML number
parsing. U+FE0E is not part of a number.

> > 123456 versus 123<ZWJ>456 versus 123<LRM>456
>
> LRM is misplaced if not totally pointless, but in general ZWJ is a fair
> point. So, numeric tailoring (one of the standard UCA parametric
> tailoring options, remember) was already potentially broken.
> 10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text likely to be
> rendered by a cursive Latin font.
>

Identifying such an edge case does not prove that numeric tailoring is
broken.

-s
Received on Mon Mar 18 2013 - 19:30:06 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 18 2013 - 19:30:06 CDT