From: Philippe Verdy (verdy_p@wanadoo.fr)

Date: Sun Jul 25 2010 - 08:15:45 CDT

Scattering does not only affect decimal digits, but also mathematical

operators needed to represent:

- the numeric sign (« - » or « + »), with at least two variants for

the same system to represent the minus sign (either the ambiguous

minus-heighten, the only one supported in many text-to-number

conversions, or the true mathematical minus sign U+2212 « − » that has

the same width as the plus sign), including some « alternating signs »

that exist in two opposite versions (« ± », « ∓ »);

- the characters that represent the decimal separator (« . » or « , »)

which is almost always needed but locale-specific (this is not just a

property of the script);

- the optional character used to note exponential notations and used

in text-to-number conversion (usually « e » or « E »);

- the optional characters used in the conventional formatting for

grouping digits (NNBSP alias « fine », with possible automatic

fallback to THINSP in font renderers and in rich-text documents

controlling the breaking property with separate style, or fallback to

NBSP in plain-text documents, or fallback to standard SPACE in

preformatted plain-text documents, « , », or « ' », and possibly other

punctuations in their « wide » form, for ideographic scripts).

Some of them exist in exponential/superscript or indice/subscript

versions (notably digits and decimal separators), but not all of them

(not all separators for grouping digits, using NNBSP may not be

appropriate as its width is not adjusted and it does not have the

semantic of a superscript or subscript).

For generality, it seems better to assume that digits and other

characters needed to note numbers in the positional decimal system may

be scattered (libraries may still avoid the small overhead of

performing table lookups, by just inspecting a property of the

character '0' or of the convention use, that will either say that it

starts a contiguous ranges, or that the complete sequence is stored in

a lookup array for the 10 digits.

The general category "Nd" may not always be accurate to find all

digits usable in decimal notations of integers, because the sequence

may have been incomplete when it was first encoded, and completed

later in scattered positions.

In this case, the digits will often have a general property of "No"

(or even "Nl") that will remain stable. What should also be stable is

their numeric value property (but I'm not sure that this is the case

of "Nl" digits, notably for scripts systems using letters in a way

similar to Greek or Hebrew letters as digits, even if Greek and Hebrew

digits are not encoded separately from the letters that these number

notations are borrowing).

Also I'm not sure that scripts that define "half-digits", or digits

with higher numeric values than 9, are permitting the use of their

digits with a numeric value between 0 and 9, in a positional decimal

system. The Roman numeric system is such a numeric system (borrowing

some scattered Latin letters and adding a few other specific digits)

where this will be completely wrong.

Or another base than 10 could be assumed by their positional system,

even if their digits are encoded in a contiguous range of characters

for the subset of values 0 to 9. This is probably no longer the case

with scripts that have modern use, but in historical scripts or in

historical texts using a modern script, the implied base may be

different and would have used more or less distinct digits. So instead

of guessing automatically from the encoded text, it may be preferable

to annotate the text (easy to insert if the conversion of the

historical text uses some rich-text format) to specify how to

interpret the numeric value of the original number.

And sometimes, the conversion to superscripts/subscripts compatibility

characters will not be possible even if some of them may be converted

safely to their numeric value, after detecting that they are in

superscript/subscript and that they don't behave the same as normal

digits (16²⁰ must NOT be interpreted as the numeric value 1620, but

must be parsed as two successive numbers 16 and 20, where the second

one has the semantic of an exponent, as if there was an exponentiation

operator between the two numbers).

It is also very frequent that only a few superscript digits will be

supported in one font, and other digits may be borrowed from another

font using a completely distinct style with distinct metrics or may

not be displayed at all (missing glyph). The result is then horrible

if you can't predict which font will be used that support the 10

digits in a contiguous range of values (even if they are scattered in

the code space).

When converting numbers to text with exponential notations, the use of

superscripts should only be used with care, knowing that this won't be

possible in all scripts, and that only integers without grouping

separators can be used.

Some writing systems (unified as « scripts » in Unicode) will still require to:

- either use rich-text styling for superscripts used in the

conventional notation of exponents,

- or use an explicit exponentiation operator, such as the ASCII symbol

U+005E "^" (which is not the same as a modifier letter circonflex

U+02C6 "ˆ", and that many fonts render at with glyph size and position

different from the the combining diacritic and implied by the modifier

letter), or a mathemetical operator or modifier letter (like the

upward arrow head U+02C4 "˄" that some fonts render as the

mathematical wedge operator on the baseline U+2227 "∧", or the less

ambiguous upward arrow U+2191 "↑").

Philippe.

