Re: Unihan number types and values

From: Kenneth Whistler (
Date: Mon Nov 29 2010 - 16:24:30 CST

  • Next message: karl williamson: "Re: Unihan number types and values"

    Marc-Andre Lemburg asked:

    > Question: Why don't these code points have the "Nd" category ?

    Because the General_Category=Nd value (and Numeric_type=Decimal)
    is explicitly limited to ordinary decimal digits that are used in
    decimal radix expressions *and* which are encoded in a contiguous
    sequence 0..9. See the character encoding stability policies
    for the recent expression of this constraint:

    The Han numeric ideographs fail the latter test. And it
    would be inadvisable to process them as gc=Nd anyway, because
    they are quite often used in traditional numbering in
    East Asia, which does not use decimal radix forms. Handling
    Han numeric ideographs requires special processing to
    parse numeric values correctly.

    > Related to this, it is also unclear what to use as official zero
    > for these number systems (U+3007 is often recommended).

    In addition to John Jenkin's clarification, I would point out
    that when Han ideographs *are* used in decimal radix
    expressions, the usual choice for a zero *digit* is U+3007.
    U+96F6 expresses the *concept* of zero. In other words,
    it is more akin to "zero" than to "0", and would seldom
    be seen used in numerical expressions.

    A postscript about the Numeric_Value and Numeric_Type properties:
    Both are derived by using values both from UnicodeData.txt and
    numeric tags from the Unihan Database. The are not "simple properties"
    by the meaning of the D45 definition in Section 3.5, Properties
    of the Unicode Standard. See the end of Section 5.4, Derived
    Extracted Properties in UAX #44 for the best current statement
    of how they are actually derived.


    This archive was generated by hypermail 2.1.5 : Mon Nov 29 2010 - 16:25:58 CST