Re: Reasonable to propose stability policy on numeric type = decimal

From: karl williamson (public@khwilliamson.com)
Date: Sun Jul 25 2010 - 11:43:11 CDT

  • Next message: Luke-Jr: "Re: CSUR Tonal"

    Philippe Verdy wrote:
    > "Kent Karlsson" <kent.karlsson14@telia.com> wrote:
    >> Den 2010-07-25 03.09, skrev "Michael Everson" <everson@evertype.com>:
    >>> On 25 Jul 2010, at 02:02, Bill Poser wrote:
    >>>> As I said, it isn't a huge issue, but scattering the digits makes the
    >>>> programming a bit more complex and error-prone and the programs a little less
    >>>> efficient.
    >>> But it would still *work*. So my hyperbole was not outrageous. And nobody has
    >>> actually scattered them. THough there are various types of "runs" in existing
    >>> encoded digits and numbers.
    >> While not formally of general category Nd (they are "No"), the superscript
    >> digits are a bit scattered:
    >>
    >> 00B2;SUPERSCRIPT TWO
    >> 00B3;SUPERSCRIPT THREE
    >> 00B9;SUPERSCRIPT ONE
    >> 2070;SUPERSCRIPT ZERO
    >> 2074;SUPERSCRIPT FOUR
    >> ...
    >> 2079;SUPERSCRIPT NINE
    >>
    >> And there are situations where one wants to interpret them as in a
    >> decimal-position system.
    >
    > Scattering does not only affect decimal digits, but also mathematical
    > operators needed to represent:
    >
    > - the numeric sign (« - » or « + »), with at least two variants for
    > the same system to represent the minus sign (either the ambiguous
    > minus-heighten, the only one supported in many text-to-number
    > conversions, or the true mathematical minus sign U+2212 « − » that has
    > the same width as the plus sign), including some « alternating signs »
    > that exist in two opposite versions (« ± », « ∓ »);
    >
    > - the characters that represent the decimal separator (« . » or « , »)
    > which is almost always needed but locale-specific (this is not just a
    > property of the script);
    >
    > - the optional character used to note exponential notations and used
    > in text-to-number conversion (usually « e » or « E »);
    >
    > - the optional characters used in the conventional formatting for
    > grouping digits (NNBSP alias « fine », with possible automatic
    > fallback to THINSP in font renderers and in rich-text documents
    > controlling the breaking property with separate style, or fallback to
    > NBSP in plain-text documents, or fallback to standard SPACE in
    > preformatted plain-text documents, « , », or « ' », and possibly other
    > punctuations in their « wide » form, for ideographic scripts).
    >
    > Some of them exist in exponential/superscript or indice/subscript
    > versions (notably digits and decimal separators), but not all of them
    > (not all separators for grouping digits, using NNBSP may not be
    > appropriate as its width is not adjusted and it does not have the
    > semantic of a superscript or subscript).
    >
    > For generality, it seems better to assume that digits and other
    > characters needed to note numbers in the positional decimal system may
    > be scattered (libraries may still avoid the small overhead of
    > performing table lookups, by just inspecting a property of the
    > character '0' or of the convention use, that will either say that it
    > starts a contiguous ranges, or that the complete sequence is stored in
    > a lookup array for the 10 digits.
    >
    > The general category "Nd" may not always be accurate to find all
    > digits usable in decimal notations of integers, because the sequence
    > may have been incomplete when it was first encoded, and completed
    > later in scattered positions.
    >
    > In this case, the digits will often have a general property of "No"
    > (or even "Nl") that will remain stable. What should also be stable is
    > their numeric value property (but I'm not sure that this is the case
    > of "Nl" digits, notably for scripts systems using letters in a way
    > similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
    > digits are not encoded separately from the letters that these number
    > notations are borrowing).
    >
    > Also I'm not sure that scripts that define "half-digits", or digits
    > with higher numeric values than 9, are permitting the use of their
    > digits with a numeric value between 0 and 9, in a positional decimal
    > system. The Roman numeric system is such a numeric system (borrowing
    > some scattered Latin letters and adding a few other specific digits)
    > where this will be completely wrong.
    >
    > Or another base than 10 could be assumed by their positional system,
    > even if their digits are encoded in a contiguous range of characters
    > for the subset of values 0 to 9. This is probably no longer the case
    > with scripts that have modern use, but in historical scripts or in
    > historical texts using a modern script, the implied base may be
    > different and would have used more or less distinct digits. So instead
    > of guessing automatically from the encoded text, it may be preferable
    > to annotate the text (easy to insert if the conversion of the
    > historical text uses some rich-text format) to specify how to
    > interpret the numeric value of the original number.
    >
    > And sometimes, the conversion to superscripts/subscripts compatibility
    > characters will not be possible even if some of them may be converted
    > safely to their numeric value, after detecting that they are in
    > superscript/subscript and that they don't behave the same as normal
    > digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
    > must be parsed as two successive numbers 16 and 20, where the second
    > one has the semantic of an exponent, as if there was an exponentiation
    > operator between the two numbers).
    >
    > It is also very frequent that only a few superscript digits will be
    > supported in one font, and other digits may be borrowed from another
    > font using a completely distinct style with distinct metrics or may
    > not be displayed at all (missing glyph). The result is then horrible
    > if you can't predict which font will be used that support the 10
    > digits in a contiguous range of values (even if they are scattered in
    > the code space).
    >
    > When converting numbers to text with exponential notations, the use of
    > superscripts should only be used with care, knowing that this won't be
    > possible in all scripts, and that only integers without grouping
    > separators can be used.
    >
    > Some writing systems (unified as « scripts » in Unicode) will still require to:
    >
    > - either use rich-text styling for superscripts used in the
    > conventional notation of exponents,
    >
    > - or use an explicit exponentiation operator, such as the ASCII symbol
    > U+005E "^" (which is not the same as a modifier letter circonflex
    > U+02C6 "ˆ", and that many fonts render at with glyph size and position
    > different from the the combining diacritic and implied by the modifier
    > letter), or a mathemetical operator or modifier letter (like the
    > upward arrow head U+02C4 "˄" that some fonts render as the
    > mathematical wedge operator on the baseline U+2227 "∧", or the less
    > ambiguous upward arrow U+2191 "↑").
    >
    > Philippe.
    >
    >
    >
    That all may be true, but it is really besides the point.

    I'm considering extending an existing computer programming language
    which currently only understands numbers composed solely by the ASCII
    numbers to also understand those from other scripts. I'm not going to
    do it unless it is easy within the existing implementation (not some
    theoretical better implementation) and efficient and not a security threat.

    The symbols for operators like exponentiation are already set in stone.,
    and their being scattered isn't relevant. Likewise, non-decimal-digit
    numbers, like subscripts, are also not relevant.

    I found a way to do the implementation that meets all my criteria, but
    is based on the existing pattern of Gc=Nd (or Nt=De) code point
    assignments. The assignments have so far been prudent, to use Asmus'
    term. I was merely trying to see if this prudence could be codified so
    that my implementation wouldn't get obsoleted on a whim in some future
    Unicode release.

    I hadn't thought of the case where a zero is later found or its usage
    develops in a script, and suddenly all the digits in that script change
    from Nt=Di to Nt=De, which because of an existing stability policy would
      necessarily require their general category changing to Nd.

    Prudence would dictate, then, that when assigning code points to the
    numbers in a script, that a contiguous block of 12-13 be reserved for
    them, such that the first one in the block be set aside for ZERO; the
    next for ONE, etc.

    My original question comes down to then, would it be reasonable to
    codify this prudence? People have said it will never happen. But no
    one has said why that is.

    Obviously, things can happen that will mess this up--the Phaistos disk
    could turn out to be a base-46 numbering system, as an extremely
    unlikely example. But by dictating prudence now, most such eventualities
    wouldn't happen.

    I have since looked at the Nt=Di characters. The ones that aren't in
    contiguous runs are the superscripts and ones that would never be
    considered to be decimal digits, such as a circled ZERO. The only run
    in the BMP which doesn't have a zero is Ethiopic. It seems extremely
    unlikely to me that a zero will be discovered or come into use with that
    script. I'm guessing that they have adopted European numbers in order
    to have commerce with the rest of the world.

    There are several runs in the SMP, but the code point where a zero would
    go isn't assigned.

    I don't know for sure, but it appears to me that we are running out of
    non-dead scripts to encode. I see that draft 6.0 has only 544 BMP code
      points not in any block and not much in the pipeline. I would think
    that most any script yet to be encoded would have borrowed numbering
    systems from their neighbors.

    And there is still plenty of space in the SMP, so this proposal to
    require prudence should not use up too many precious unassigned code points.



    This archive was generated by hypermail 2.1.5 : Sun Jul 25 2010 - 11:48:07 CDT