Re: Processing Digit Variants

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 18 Mar 2013 23:06:11 +0000

On Mon, 18 Mar 2013 21:07:27 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> Richard Wordingham wrote:
>
> > European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used
> > with variation selectors. As their primary purpose is for use with
> > u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to
> > recognise strings of digits with variation selectors as representing
> > numbers?
 
> Legitimate for *what*?

A process claiming to handle numbers, without being more specific. The
primary purpose of variation selectors is to choose the right glyph, so
one might expect the variation selectors not to affect collation.

> Recognizing "numbers" is really outside the scope of the Unicode
> Standard, although admittedly it is not outside the scope of LDML,
> which does need to recognize numeric formats for localization.

> By default, at least, the presence of variation selectors shouldn't
> affect searching or collation.

They break contractions, whatever Philippe Verdy make think. Moreover,
that may not be true of compatibility ideographs encoded as unified
ideograph plus variation selector - one might want the variation
selector to affect searching and collation.

> It seems to me that the more
> significant issue here would be whether the enclosing combining marks
> are present, whether or not any variation selectors are present. So:
>
> <U+0031, U+20E3, U+0032, U+20E3>
>
> Isn't much different, for this purpose, than:
>
> <U+0031, U+FE0F, U+20E3, U+0032, U+FE0F, U+20E3>
>
> I wouldn't really expect most processes to recognize either of those
> sequences as "a number" for parsing purposes.

Nor I, as they're not much closer that <U+2460 CIRCLED DIGIT ONE,
U+2461 CIRCLED DIGIT TWO>. That is why I wondered if one could argue
that their use in multi-digit numbers was not playing the game, and
therefore one should not be surprised if things went wrong.

The issue is rather with emphatically plain text <U+0031, U+FE0E,
U+0032, U+FE0E>.

> But if your issue here is worrying about whether the presence of
> variation selectors would screw up collation with numeric tailoring,
> it seems to me that is really an extreme edge case of an edge case,
> anyway.
>
> My expectation would be, rather, that if you are planning to do
> anything really significant with numbers, you'd have to have a fully
> tokenizing parser, anyway, at which point you assign some appropriate
> numeric value to your token and do something significant with it
> thereafter. Modifying such a parser to either a) ignore the presence
> of variation selectors (or any other format control characters), or
> b) treat the presence of variation selectors (or any other format
> control characters) as an error, ought to be relatively routine.

So one might reasonably amend the parsing rules to ignore control
characters.

> If, on the other hand, you are doing "numeric tailoring" for
> collation by using ICU-style tailoring rules and expecting the string
> comparison routine for collation to produce numerical ordering, I
> would think that is likely to not be very robust, and could be
> complicated by all sorts of format issues, not merely the presence or
> absence of variation selectors.

> I don't see this as fundamentally any different for the variation
> selectors than it would be for other format controls. So what would
> you be doing, for example, with "numbers" like the following:

> 123456 versus 123<ZWJ>456 versus 123<LRM>456

LRM is misplaced if not totally pointless, but in general ZWJ is a fair
point. So, numeric tailoring (one of the standard UCA parametric
tailoring options, remember) was already potentially broken.
10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text likely to be
rendered by a cursive Latin font.

Richard.
Received on Mon Mar 18 2013 - 18:09:53 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 18 2013 - 18:09:55 CDT