Re: Processing Digit Variants

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Tue, 19 Mar 2013 16:30:26 -0700

On the basis of security considerations, it might be necessary to not
allow variation selectors to "salt" strings for parsing. If the string
cannot be rejected, then the proper thing might be to parse it as if the
variation selectors were not present (on the basis that they do not
affect semantics - by design - setting aside Han for the moment, where
that story isn't totally clear).

Similar considerations would apply to other invisible characters, like
redundant directional marks, as well as joiners and non-joiners. Again,
if their presence can't be used to reject a string, parsing needs to
handle them properly, so that what the user "sees" is what actually gets
parsed.

A./

On 3/19/2013 1:45 PM, Richard Wordingham wrote:
> On Mon, 18 Mar 2013 17:28:30 -0700
> "Steven R. Loomis" <srl_at_icu-project.org> wrote:
>> On Monday, March 18, 2013, Richard Wordingham wrote:
>>> The issue is rather with emphatically plain text <U+0031, U+FE0E,
>>> U+0032, U+FE0E>.
>> It's the same situation to something like an implementation of LDML
>> number parsing. U+FE0E is not part of a number.
> I agree that the same arguments are applicable to both parsing and
> collating, though not necessarily with equal force.
>
> Formally, <U+0031, U+FE0E, U+0032, U+FE0E> seems to be just as much a
> number as <U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO>,
> which the current LDML semantics do treat on an even footing with
> "12". If the emoji digits had been encoded as new characters, ICU
> would support them without batting an eyelid. Because the difference
> does not merit full characterhood, they are encoded by a sequence
> rather than a single character. Remember, all that U+FE0E does is to
> request a particular glyph. In a sense, we have 20 new decimal digits,
> <U+0030, U+FE0E> to <U+0039, U+FE0F> and <U+0030, U+FE0F> to <U+0039,
> U+FE0F>.
>
> So, why do you consider <U+0031, U+FE0E, U+0032, U+FE0E> not to be
> a valid decimal number?
>
>>> 10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
>>> likely to be rendered by a cursive Latin font.
>> Identifying such an edge case does not prove that numeric tailoring is
>> broken.
> An 'edge case' is often just a case that shows that an algorithm that
> often works has not been thought through thoroughly. Now, as CLDR
> seems to value speed above perfect correctness, perhaps handling
> variation sequences will be rejected on that basis. All I was trying
> to find out on this list was whether <U+0031, U+FE0E, U+0032, U+FE0E>
> should be regarded as a proper number.
>
> Special characters intended for just one aspect of text processing
> should not affect other aspects. Unfortunately, a parametric tailoring
> to ignore irrelevant characters while complying with the UCA is not
> quite as simple as just ignoring them. The issues arise with the
> blocking of discontiguous contractions and the possibility that, for
> example, one might wish to collate character variants differently. On
> the other hand, ignoring variation selectors by default might be
> excusable, for they should not occur where they might block canonical
> reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).
>
> Richard.
>
>
Received on Tue Mar 19 2013 - 18:33:41 CDT

This archive was generated by hypermail 2.2.0 : Tue Mar 19 2013 - 18:33:42 CDT