Re: Reasonable to propose stability policy on numeric type = decimal

From: Philippe Verdy (
Date: Sun Jul 25 2010 - 08:15:45 CDT

  • Next message: karl williamson: "Re: Reasonable to propose stability policy on numeric type = decimal"

    "Kent Karlsson" <> wrote:
    > Den 2010-07-25 03.09, skrev "Michael Everson" <>:
    > > On 25 Jul 2010, at 02:02, Bill Poser wrote:
    > >> As I said, it isn't a huge issue, but scattering the digits makes the
    > >> programming a bit more complex and error-prone and the programs a little less
    > >> efficient.
    > >
    > > But it would still *work*. So my hyperbole was not outrageous. And nobody has
    > > actually scattered them. THough there are various types of "runs" in existing
    > > encoded digits and numbers.
    > While not formally of general category Nd (they are "No"), the superscript
    > digits are a bit scattered:
    > ...
    > And there are situations where one wants to interpret them as in a
    > decimal-position system.

    Scattering does not only affect decimal digits, but also mathematical
    operators needed to represent:

    - the numeric sign (« - » or « + »), with at least two variants for
    the same system to represent the minus sign (either the ambiguous
    minus-heighten, the only one supported in many text-to-number
    conversions, or the true mathematical minus sign U+2212 « − » that has
    the same width as the plus sign), including some « alternating signs »
    that exist in two opposite versions (« ± », « ∓ »);

    - the characters that represent the decimal separator (« . » or « , »)
    which is almost always needed but locale-specific (this is not just a
    property of the script);

    - the optional character used to note exponential notations and used
    in text-to-number conversion (usually « e » or « E »);

    - the optional characters used in the conventional formatting for
    grouping digits (NNBSP alias « fine », with possible automatic
    fallback to THINSP in font renderers and in rich-text documents
    controlling the breaking property with separate style, or fallback to
    NBSP in plain-text documents, or fallback to standard SPACE in
    preformatted plain-text documents, « , », or « ' », and possibly other
    punctuations in their « wide » form, for ideographic scripts).

    Some of them exist in exponential/superscript or indice/subscript
    versions (notably digits and decimal separators), but not all of them
    (not all separators for grouping digits, using NNBSP may not be
    appropriate as its width is not adjusted and it does not have the
    semantic of a superscript or subscript).

    For generality, it seems better to assume that digits and other
    characters needed to note numbers in the positional decimal system may
    be scattered (libraries may still avoid the small overhead of
    performing table lookups, by just inspecting a property of the
    character '0' or of the convention use, that will either say that it
    starts a contiguous ranges, or that the complete sequence is stored in
    a lookup array for the 10 digits.

    The general category "Nd" may not always be accurate to find all
    digits usable in decimal notations of integers, because the sequence
    may have been incomplete when it was first encoded, and completed
    later in scattered positions.

    In this case, the digits will often have a general property of "No"
    (or even "Nl") that will remain stable. What should also be stable is
    their numeric value property (but I'm not sure that this is the case
    of "Nl" digits, notably for scripts systems using letters in a way
    similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
    digits are not encoded separately from the letters that these number
    notations are borrowing).

    Also I'm not sure that scripts that define "half-digits", or digits
    with higher numeric values than 9, are permitting the use of their
    digits with a numeric value between 0 and 9, in a positional decimal
    system. The Roman numeric system is such a numeric system (borrowing
    some scattered Latin letters and adding a few other specific digits)
    where this will be completely wrong.

    Or another base than 10 could be assumed by their positional system,
    even if their digits are encoded in a contiguous range of characters
    for the subset of values 0 to 9. This is probably no longer the case
    with scripts that have modern use, but in historical scripts or in
    historical texts using a modern script, the implied base may be
    different and would have used more or less distinct digits. So instead
    of guessing automatically from the encoded text, it may be preferable
    to annotate the text (easy to insert if the conversion of the
    historical text uses some rich-text format) to specify how to
    interpret the numeric value of the original number.

    And sometimes, the conversion to superscripts/subscripts compatibility
    characters will not be possible even if some of them may be converted
    safely to their numeric value, after detecting that they are in
    superscript/subscript and that they don't behave the same as normal
    digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
    must be parsed as two successive numbers 16 and 20, where the second
    one has the semantic of an exponent, as if there was an exponentiation
    operator between the two numbers).

    It is also very frequent that only a few superscript digits will be
    supported in one font, and other digits may be borrowed from another
    font using a completely distinct style with distinct metrics or may
    not be displayed at all (missing glyph). The result is then horrible
    if you can't predict which font will be used that support the 10
    digits in a contiguous range of values (even if they are scattered in
    the code space).

    When converting numbers to text with exponential notations, the use of
    superscripts should only be used with care, knowing that this won't be
    possible in all scripts, and that only integers without grouping
    separators can be used.

    Some writing systems (unified as « scripts » in Unicode) will still require to:

    - either use rich-text styling for superscripts used in the
    conventional notation of exponents,

    - or use an explicit exponentiation operator, such as the ASCII symbol
    U+005E "^" (which is not the same as a modifier letter circonflex
    U+02C6 "ˆ", and that many fonts render at with glyph size and position
    different from the the combining diacritic and implied by the modifier
    letter), or a mathemetical operator or modifier letter (like the
    upward arrow head U+02C4 "˄" that some fonts render as the
    mathematical wedge operator on the baseline U+2227 "∧", or the less
    ambiguous upward arrow U+2191 "↑").


    This archive was generated by hypermail 2.1.5 : Sun Jul 25 2010 - 08:21:32 CDT