RE: Processing Digit Variants from Whistler, Ken on 2013-03-18 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Mon, 18 Mar 2013 21:07:27 +0000

Richard Wordingham wrote:

> European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used
> with variation selectors. As their primary purpose is for use with
> u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to
> recognise strings of digits with variation selectors as representing
> numbers?

Legitimate for *what*?

I supposed you could say that if a process claims to recognize strings of
digits with variation selectors as representing numbers, then it would
not be legitimate for that process to fail to do so.

Conversely, if a process does not claim to recognize strings of digits
with variation selectors as representing numbers, then it would be
legitimate (and expected) for that process to fail to do so.

Recognizing "numbers" is really outside the scope of the Unicode Standard,
although admittedly it is not outside the scope of LDML, which does
need to recognize numeric formats for localization.

>
> If not, it seems that I will have to raise this as an issue for LDML,
> as it affects parsing and collation with the numeric tailoring.

By default, at least, the presence of variation selectors shouldn't affect
searching or collation. It seems to me that the more significant issue
here would be whether the enclosing combining marks are present,
whether or not any variation selectors are present. So:

<U+0031, U+20E3, U+0032, U+20E3>

Isn't much different, for this purpose, than:

<U+0031, U+FE0F, U+20E3, U+0032, U+FE0F, U+20E3>

I wouldn't really expect most processes to recognize either of those
sequences as "a number" for parsing purposes.

But if your issue here is worrying about whether the presence of
variation selectors would screw up collation with numeric tailoring, it
seems to me that is really an extreme edge case of an edge case, anyway.

My expectation would be, rather, that if you are planning to do anything
really significant with numbers, you'd have to have a fully tokenizing
parser, anyway, at which point you assign some appropriate numeric
value to your token and do something significant with it thereafter.
Modifying such a parser to either a) ignore the presence of
variation selectors (or any other format control characters), or b) treat
the presence of variation selectors (or any other format control
characters) as an error, ought to be relatively routine.

If, on the other hand, you are doing "numeric tailoring" for collation by
using ICU-style tailoring rules and expecting the string comparison
routine for collation to produce numerical ordering, I would think that
is likely to not be very robust, and could be complicated by all sorts
of format issues, not merely the presence or absence of variation
selectors.

I don't see this as fundamentally any different for the variation
selectors than it would be for other format controls. So what would you
be doing, for example, with "numbers" like the following:

123456 versus 123<ZWJ>456 versus 123<LRM>456

--Ken
Received on Mon Mar 18 2013 - 16:10:24 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 18 2013 - 16:10:25 CDT