RE: numeric properties of Nl characters in the UCD

From: Jim Allan (jallan@smrtytrek.com)
Date: Thu Nov 27 2003 - 12:01:39 EST

  • Next message: Mark Davis: "Re: Decimal digit property - What's it for?"

    Arcane Jill wrote:

    > But there doesn't seem to be any way of specifying operator precedence
    > in Unicode text (by which I mean the precedence of ZWJ compared with the
    > precedence of any modifier). I can see a case for "invisible brackets"
    > here to control such precedence.

    Unicode is intended to encode normal text as written or inscribed by
    human beings and as read by human beings.

    If the plain text is ambiguous (where in operator precedence or in some
    other way) it is normally not for Unicode to resolve the semantics of
    the text.

    Invisible characters that cannot be seen by human beings resolve nothing
    when the text is viewed in a normal mode by human beings. The normal
    purpose for which text is created is to be read by human beings.

    Source code is not an exception.

    Source code is intended primarily to allow instructions to a computer to
    be created in a way that is more easily comprehended by human beings
    than binary instructions. Invisible characters that effect the semantics
    of source would only create unresolvable ambiguity for the human beings
    who read the source code in normal display or who print it out.

    If you want a notation in text to be unambiguous, make it unambiguous
    using characters that can be seen by the humans who interpret it.

    Unicode does provide the invisible operators U+2061 FUNCTION
    APPLICATION, U+2062 INVISIBLE TIMES and U+2063 INVISIBLE SEPARATOR for
    particular use for text which is intended to also be used for
    mathematical calculations. I would be surprised if the use of these
    characters did not turn out to be very dangerous in practice.

    > The review on Ethiopic and Tamil non-decimal digits is interesting, but
    > I can't help but feel it was a culturally biased decision (read:
    > mistake) to EVER have had a "radix ten" property without any similar
    > property for any other radix, thereby forcing non-decimal digits to end
    > up being classified as No (Other_Number) instead of Nd (Number_Decimal).
    > It's a mistake because, even in /my/ culture, digit one followed by
    > digit two is not always interpretted as the number twelve. Phone numbers
    > and PINs are one exception. Version numbers such as "version 12.12.12"
    > are another exception. Octal is another

    That a character has the property of being a decimal digit makes no
    assertion that the character may not be used in other ways: octal digit,
    base-25 digit, used as a letter with phonetic value in some
    transliteration systems or used as part of a character description in
    Rongorongo transliteration (see
    http://www.rongorongo.org/corpus/codes.html). Unicode lays *no* limits
    on how users may use any character. That is not Unicode's business.

    All characters that are decimal digits also have the property hex_digit.
    But such a digit may in fact be used in ways that are neither decimal or
    hexadecimal. The properties only reflect what users of scripts that use
    them see as the normal interpretation of such characters. They are only
    useful hints.

    I see no cultural bias in noting that certain characters in certain
    scripts are primarily used as part of radix ten notation when that is
    indeed the primary meaning of these characters.

    > One implication is that hexadecimal numbers cannot be expressed in
    > Unicode without violating this property. For instance, is the string
    > "U+0012" valid Unicode, given that "the sequence of the ONE character
    > followed by the TWO character is [NOT] interpreted as having the value
    > of twelve"?

    The string "U+0012" is valid Unicode.

    Similary the strings "U+0A53", "U+X&@2" and "+U0012" are valid Unicode.
    The interpretation of strings produced by users is not Unicode's
    business. That the meaning of a particular string is nonsense or
    ambiguous is not Unicode's business. The probable meaning of the string
    "&#1234" versus the string "&#x1234" is not Unicode's business. The
    Unicode standard provides no instructions about necessary interpretation
    of strings.

    That "12" might be hexadecimal or octal or something else other than
    decimal twelve in some contexts is outside of any Unicode specification.
    Unicode's task is only to provide a coding that allows representation of
    the string "12".

    As an additional piece of usefulness the Unicode specification provides
    properties that make it easier for processes to find and interpret
    numeric quantities in text. But these properties are really only hints,
    indicating the most common uses for such characters, certainly not
    limiting them to such uses.

    If someone wants to represent the medieval spelling of _knight_ by
    _kni3t_ (using "3" instead of the proper yogh symbol U+021C) because the
    yogh is likely not to appear properly in many applications, they may
    certainly do so even though "3" is not a letter.

    > Perhaps it would have made sense to simply have different properties all
    > round, such as: "number positional" for digits in any radix; "number
    > integer" for integer types such as circled 2 which can't be used
    > positionally; "number fraction" for fractions, and "number other" for
    > everything else. Or maybe some other similar scheme. Is it too late to
    > change things now?

    Judging from the past, additional properties will be added to the
    Unicode specification. The reason for new properties being added should
    be that they are *generally useful for character handling* rather than
    that they are useful to specialized applications. Specialized
    applications can and should define their own properties for their own
    needs or use.

    As to "'number positional'" for digits in any radix, it might be useful
    to add a property "possible positional digit for any radix up to 36" for
    the normal ASCII digits and the uppercase and lowercase characters of
    the normal twenty-six letters in the ASCII character set.

    But is this generally useful enough to warrant it being part of the
    Unicode specification?

    And is that not also culturally biased? But then all scripts are to
    some degree bound to a particular culture or to particular cultures.

    Jim Allan



    This archive was generated by hypermail 2.1.5 : Thu Nov 27 2003 - 13:02:51 EST