**From:** CE Whitehead (*cewcathar@hotmail.com*)

**Date:** Sun Jul 25 2010 - 15:24:01 CDT

**Previous message:**Asmus Freytag: "Re: Reasonable to propose stability policy on numeric type = decimal"**In reply to:**karl williamson: "Re: Reasonable to propose stability policy on numeric type = decimal"**Next in thread:**CE Whitehead: "RE: Reasonable to propose stability policy on numeric type = decimal"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

Damn this indecision; I don't know; shall I take an axe to it, or shall I let it grow. --from Judith Wright, "That Seed"

*> Date: Sun, 25 Jul 2010 10:43:11 -0600
*

*> From: public@khwilliamson.com
*

*> To: verdy_p@wanadoo.fr
*

*> CC: kent.karlsson14@telia.com; unicode@unicode.org
*

*> Subject: Re: Reasonable to propose stability policy on numeric type = decimal
*

*>
*

*> Philippe Verdy wrote:
*

*> > "Kent Karlsson" <kent.karlsson14@telia.com> wrote:
*

*> >> Den 2010-07-25 03.09, skrev "Michael Everson" <everson@evertype.com>:
*

*> >>> On 25 Jul 2010, at 02:02, Bill Poser wrote:
*

*> >>>> As I said, it isn't a huge issue, but scattering the digits makes the
*

*> >>>> programming a bit more complex and error-prone and the programs a little less
*

*> >>>> efficient.
*

*> >>> But it would still *work*. So my hyperbole was not outrageous. And nobody has
*

*> >>> actually scattered them. THough there are various types of "runs" in existing
*

*> >>> encoded digits and numbers.
*

*> >> While not formally of general category Nd (they are "No"), the superscript
*

*> >> digits are a bit scattered:
*

*> >>
*

*> >> 00B2;SUPERSCRIPT TWO
*

*> >> 00B3;SUPERSCRIPT THREE
*

*> >> 00B9;SUPERSCRIPT ONE
*

*> >> 2070;SUPERSCRIPT ZERO
*

*> >> 2074;SUPERSCRIPT FOUR
*

*> >> ...
*

*> >> 2079;SUPERSCRIPT NINE
*

*> >>
*

*> >> And there are situations where one wants to interpret them as in a
*

*> >> decimal-position system.
*

*> >
*

*> > Scattering does not only affect decimal digits, but also mathematical
*

*> > operators needed to represent:
*

*> >
*

*> > - the numeric sign (« - » or « + »), with at least two variants for
*

*> > the same system to represent the minus sign (either the ambiguous
*

*> > minus-heighten, the only one supported in many text-to-number
*

*> > conversions, or the true mathematical minus sign U+2212 « − » that has
*

*> > the same width as the plus sign), including some « alternating signs »
*

*> > that exist in two opposite versions (« ± », « ∓ »);
*

*> >
*

*> > - the characters that represent the decimal separator (« . » or « , »)
*

*> > which is almost always needed but locale-specific (this is not just a
*

*> > property of the script);
*

*> >
*

*> > - the optional character used to note exponential notations and used
*

*> > in text-to-number conversion (usually « e » or « E »);
*

*> >
*

*> > - the optional characters used in the conventional formatting for
*

*> > grouping digits (NNBSP alias « fine », with possible automatic
*

*> > fallback to THINSP in font renderers and in rich-text documents
*

*> > controlling the breaking property with separate style, or fallback to
*

*> > NBSP in plain-text documents, or fallback to standard SPACE in
*

*> > preformatted plain-text documents, « , », or « ' », and possibly other
*

*> > punctuations in their « wide » form, for ideographic scripts).
*

*> >
*

*> > Some of them exist in exponential/superscript or indice/subscript
*

*> > versions (notably digits and decimal separators), but not all of them
*

*> > (not all separators for grouping digits, using NNBSP may not be
*

*> > appropriate as its width is not adjusted and it does not have the
*

*> > semantic of a superscript or subscript).
*

*> >
*

*> > For generality, it seems better to assume that digits and other
*

*> > characters needed to note numbers in the positional decimal system may
*

*> > be scattered (libraries may still avoid the small overhead of
*

*> > performing table lookups, by just inspecting a property of the
*

*> > character '0' or of the convention use, that will either say that it
*

*> > starts a contiguous ranges, or that the complete sequence is stored in
*

*> > a lookup array for the 10 digits.
*

*> >
*

*> > The general category "Nd" may not always be accurate to find all
*

*> > digits usable in decimal notations of integers, because the sequence
*

*> > may have been incomplete when it was first encoded, and completed
*

*> > later in scattered positions.
*

*> >
*

*> > In this case, the digits will often have a general property of "No"
*

*> > (or even "Nl") that will remain stable. What should also be stable is
*

*> > their numeric value property (but I'm not sure that this is the case
*

*> > of "Nl" digits, notably for scripts systems using letters in a way
*

*> > similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
*

*> > digits are not encoded separately from the letters that these number
*

*> > notations are borrowing).
*

*> >
*

*> > Also I'm not sure that scripts that define "half-digits", or digits
*

*> > with higher numeric values than 9, are permitting the use of their
*

*> > digits with a numeric value between 0 and 9, in a positional decimal
*

*> > system. The Roman numeric system is such a numeric system (borrowing
*

*> > some scattered Latin letters and adding a few other specific digits)
*

*> > where this will be completely wrong.
*

*> >
*

*> > Or another base than 10 could be assumed by their positional system,
*

*> > even if their digits are encoded in a contiguous range of characters
*

*> > for the subset of values 0 to 9. This is probably no longer the case
*

*> > with scripts that have modern use, but in historical scripts or in
*

*> > historical texts using a modern script, the implied base may be
*

*> > different and would have used more or less distinct digits. So instead
*

*> > of guessing automatically from the encoded text, it may be preferable
*

*> > to annotate the text (easy to insert if the conversion of the
*

*> > historical text uses some rich-text format) to specify how to
*

*> > interpret the numeric value of the original number.
*

*> >
*

*> > And sometimes, the conversion to superscripts/subscripts compatibility
*

*> > characters will not be possible even if some of them may be converted
*

*> > safely to their numeric value, after detecting that they are in
*

*> > superscript/subscript and that they don't behave the same as normal
*

*> > digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
*

*> > must be parsed as two successive numbers 16 and 20, where the second
*

*> > one has the semantic of an exponent, as if there was an exponentiation
*

*> > operator between the two numbers).
*

*> >
*

*> > It is also very frequent that only a few superscript digits will be
*

*> > supported in one font, and other digits may be borrowed from another
*

*> > font using a completely distinct style with distinct metrics or may
*

*> > not be displayed at all (missing glyph). The result is then horrible
*

*> > if you can't predict which font will be used that support the 10
*

*> > digits in a contiguous range of values (even if they are scattered in
*

*> > the code space).
*

*> >
*

This does seem relevant to me.

*> > When converting numbers to text with exponential notations, the use of
*

*> > superscripts should only be used with care, knowing that this won't be
*

*> > possible in all scripts, and that only integers without grouping
*

*> > separators can be used.
*

*> >
*

*> > Some writing systems (unified as « scripts » in Unicode) will still require to:
*

*> >
*

*> > - either use rich-text styling for superscripts used in the
*

*> > conventional notation of exponents,
*

*> >
*

*> > - or use an explicit exponentiation operator, such as the ASCII symbol
*

*> > U+005E "^" (which is not the same as a modifier letter circonflex
*

*> > U+02C6 "ˆ", and that many fonts render at with glyph size and position
*

*> > different from the the combining diacritic and implied by the modifier
*

*> > letter), or a mathemetical operator or modifier letter (like the
*

*> > upward arrow head U+02C4 "˄" that some fonts render as the
*

*> > mathematical wedge operator on the baseline U+2227 "∧", or the less
*

*> > ambiguous upward arrow U+2191 "↑").
*

*> >
*

*> > Philippe.
*

*> >
*

*> >
*

*> >
*

*> That all may be true, but it is really besides the point.
*

*>
*

*> I'm considering extending an existing computer programming language
*

*> which currently only understands numbers composed solely by the ASCII
*

*> numbers to also understand those from other scripts. I'm not going to
*

*> do it unless it is easy within the existing implementation (not some
*

*> theoretical better implementation) and efficient and not a security threat.
*

*>
*

*> The symbols for operators like exponentiation are already set in stone.,
*

*> and their being scattered isn't relevant. Likewise, non-decimal-digit
*

*> numbers, like subscripts, are also not relevant.
*

*>
*

*> I found a way to do the implementation that meets all my criteria, but
*

*> is based on the existing pattern of Gc=Nd (or Nt=De) code point
*

*> assignments. The assignments have so far been prudent, to use Asmus'
*

*> term. I was merely trying to see if this prudence could be codified so
*

*> that my implementation wouldn't get obsoleted on a whim in some future
*

*> Unicode release.
*

*>
*

*> I hadn't thought of the case where a zero is later found or its usage
*

*> develops in a script, and suddenly all the digits in that script change
*

*> from Nt=Di to Nt=De, which because of an existing stability policy would
*

*> necessarily require their general category changing to Nd.
*

*>
*

*> Prudence would dictate, then, that when assigning code points to the
*

*> numbers in a script, that a contiguous block of 12-13 be reserved for
*

*> them, such that the first one in the block be set aside for ZERO; the
*

*> next for ONE, etc.
*

*>
*

*> My original question comes down to then, would it be reasonable to
*

*> codify this prudence? People have said it will never happen. But no
*

*> one has said why that is.
*

*>
*

*> Obviously, things can happen that will mess this up--the Phaistos disk
*

*> could turn out to be a base-46 numbering system, as an extremely
*

*> unlikely example. But by dictating prudence now, most such eventualities
*

*> wouldn't happen.
*

*>
*

*> I have since looked at the Nt=Di characters. The ones that aren't in
*

*> contiguous runs are the superscripts and ones that would never be
*

*> considered to be decimal digits, such as a circled ZERO.
*

Hi

Are you proposing that superscripts be in contiguous runs or not? Above you disallowed subscripts (although

I think mathematically subscripts have some meaning in equations as do superscripts and it might worth converting them albeit separately from other numbers; if these were converted it would allow complete equations to be converted from character strings -- but with only digits 1-9 I do not see that much of an issue; I'd personally like to find a subscript i; but so far I've just looked at: http://unicode.org/charts/PDF/U2070.pdf where the subscripts 0-9 are all contiguous but the superscript 1, 2, and 3 are not; searching through http://unicode.org/Public/UNIDATA/UnicodeData.txt that was all I found; I then started going through code charts one by one and so far have gotten as far as Old South Arabian and have not found superscript i or more superscript decimal numbers though maybe I've missed something -- the Arabic sukun is not going to be part of a series of superscripts in any case).

*> The only run
*

*> in the BMP which doesn't have a zero is Ethiopic. It seems extremely
*

*> unlikely to me that a zero will be discovered or come into use with that
*

*> script. I'm guessing that they have adopted European numbers in order
*

*> to have commerce with the rest of the world.
*

*>
*

*> There are several runs in the SMP, but the code point where a zero would
*

*> go isn't assigned.
*

*>
*

*> I don't know for sure, but it appears to me that we are running out of
*

*> non-dead scripts to encode. I see that draft 6.0 has only 544 BMP code
*

*> points not in any block and not much in the pipeline. I would think
*

*> that most any script yet to be encoded would have borrowed numbering
*

*> systems from their neighbors.
*

*>
*

*> And there is still plenty of space in the SMP, so this proposal to
*

*> require prudence should not use up too many precious unassigned code points.
*

*>
*

If it does not take up too much space; I support this proposal although there is no way that characters are contiguous in any case -- so for doing sorts and such this is not going to help really normally.

Best,

C. E. Whitehead

cewcathar@hotmail.com

**Next message:**CE Whitehead: "RE: Reasonable to propose stability policy on numeric type = decimal"**Previous message:**Asmus Freytag: "Re: Reasonable to propose stability policy on numeric type = decimal"**In reply to:**karl williamson: "Re: Reasonable to propose stability policy on numeric type = decimal"**Next in thread:**CE Whitehead: "RE: Reasonable to propose stability policy on numeric type = decimal"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Sun Jul 25 2010 - 15:27:40 CDT
*