L2/03-094

Re: Han Properties
From: Mark Davis, Ken Whistler
Date: 2003-03-03

When it came time to implement the following:

92-A65] Action Item for Mark Davis, John Jenkins: For version 4, add the Numerical, Radical/Stroke, & Source Reference PropertyAliases.txt and PropertyValueAliases.txt as normative properties, and create a "provisional" property and state that all other Unihan tags are provisional. [L2/02-267R3]

We ran up against a few issues. Our recommendations are

1. Simply incorporate the numeric values from the Unihan file without adding a new property, using the numeric_type of nt (numeric). Most people using APIs that provide this new information would simply need the numeric values, and just care about the fact that they are non-decimal. So this makes it easier for the data to be incorporated, and the values are then just present in the extracted files: DerivedNumericType and DerivedNumericValue. (Many people will not bother to parse the Unihan.txt file if this is all the information they need from there; they will just pick up the extracted files).

2. Ken is concerned that numeric values for two characters (5793 gai1 and 4EAC jing1) in Unihan.txt are problematic, and should be removed (see below).

3. There are issues with the Source Standards. This is really not the kind of property that we should be surfacing. Although this information is crucial to the standards development work, it is not a property anyone would use. It is even misleading, since the source standards are 'idealized', and don't represent the actual mapping values that people would necessarily use in conversion tables. So we suggest dropping them.


Addendum from Ken

A. Han Numeric property issues.

The simplest way (Option A) to comply with the UTC consensus is to create a
property for each of the labelled fields in Unihan.txt that we
are attempting to promote to normative status:

kPrimaryNumeric    ==>  Han_Primary_Numeric  (value: numeric)
kAccountingNumeric ==>  Han_Accounting_Numeric  (value: numeric)
kOtherNumeric      ==>  Han_Other_Numeric  (value: numeric)

Option B is to notice that these designations are in complementary
distribution, and then to create an enumerated property for the type,
and a numeric property for the value:

Han_Numeric_Type  (value: enumerated [Han_Primary, Han_Accounting, Han_Other])
Han_Numeric_Value (value: numeric)

Option C is to notice that these designations are in complementary
distribution with existing numeric types for non-Han characters, and then
to merge them into the existing property by extending the enumeration:

Numeric_Type (value: enumerated [Decimal, Digit, Numeric, None,
Han_Primary, Han_Accounting, Han_Other])
Numeric_Value     (value: numeric)

In some regards, I think Option A is
the simplest and least riddled with complications and side effects.
Option C is the most problematical, since it changes the enumeration
of an existing type. It also mixes apples and oranges, since it
enumerates distinctions relevant to Han characters on the same
virtual axis as distinctions which were originally put in place
to try to distinguish digits from non-digits.

This leads to yet another possible solution, Option D. Designate
all Han numerics as having nt=Numeric, and then create a property
type for the complementary distribution of subtypes relevant only
to Han:

Numeric_Type (value: enumerated [Decimal, Digit, Numeric, None])
    (unchanged, but now specify the list of Han characters with
     nt=Numeric)
Numeric_Value    (value: numeric)   (unchanged)
Han_Numeric_Subtype (value: enumerated [Han_Primary, Han_Accounting, Han_Other])

In my opinion, *that* approach would be the easiest for people with
existing API's to accomodate.

Even simpler would be to omit the Han_Numeric_Subtype definition
as well, since it can be derived from the Unihan.txt tags,
and is not central to the problem of providing numeric values
for Han characters.

Another fly in the ointment is that some Han characters that we
encoded as (Nl) symbols, rather than as unified or compatibility CJK
characters, themselves have Numeric values: 3000, 3021..3029, 3038..303A.
E.g.:

3039;HANGZHOU NUMERAL TWENTY;Nl;0;L;<compat> 5344;;;20;N;;;;;

This shows gc=Nl and nt=Numeric and nv=20 for this character.

And for the Hangzhou-style numerals, these overlap with those few
odds and ends which get the kOtherNumeric tag. These should be accounted
for in any solution which deals with the Han numeric values.

There is an additional problem posed by two particular
kPrimaryNumeric characters in Unihan.txt:

This is a problem because the two characters in question are bizarre
in the first place, and should *not* be given normative numeric
values as they currently are. Quoting myself:

> kPrimaryNumeric adds two values which
> are *not* in Table 4-3 in the book -- and probably not
> there for good reason. Those are 5793 gai1 and 4EAC jing1.
> Unihan claims 4EAC is 10 quadrillion, i.e. (10000)^4 and
> that 5793 is 100 quintillion, i.e. (10000)^5. I've checked
> two dictionaries. Both claim that 5793 gai1 means 100 million.
> One doesn't list a numeric value at all for 4EAC jing1, which
> is a common character meaning 'capital', but the other lists
> it also as an "ancient numeral" meaning 10 million.
> A third, classical dictionary (Cihai) says of 4EAC jing1:
> "Name of a number. 10 zhao4 [5146] constitute a jing1,
> there are also those who aver that 10,000 zhao4 constitute
> a jing1." So by that reckoning, it could either be
> 10 trillion or 10 quadrillion. That same classic dictionary
> cite a source for gai1 which claims that "10 man4 is called
> yi4, 10 yi4 is called zhao4, 10 zhao4 is called jing1,
> 10 jing1 is called gai1" (Incidentally that jing1 is
> 7D93, *not* 4EAC, although the commentator says it is
> meant for the same thing.) And then the commentator says
> "there are also those who aver that 10,000 jing1 constitute
> a gai1". Clearly nobody *really* knows what the heck these
> numbers referred to. They probably started out a fantasy
> concepts, equivalent to bezillion and gazillion, respectively.
> One of the alternate meanings of jing1 'capital' is just 'big'.
> The rationalization that jing1 means 10 quadrillion and
> gai1 means 100 quintillion are just that -- rationalizations
> by later commentators using the rank by 10,000's concept
> of man4 (10,000), yi4 (100,000,000) and zhao4 (1,000,000,000,000).
>
> *NOBODY* uses these two characters as numbers in China.
> It would be a disservice to our implementers and to other
> users of the standard to take these fantastic commentaries
> on imagined big numbers and reify them in API's that have
> to spit out 10^16 and 10^20 as numeric values.

One possible solution is a new Unihan tag. ;-)

U+4EAC kFantasyNumeric bezillion (uncertain ancient large numeric
quantity, variously annotated as equal to 10 million or
to 10 quadrillion)
    
U+5793 kFantasyNumeric gazillion (etc....)

B. IRG_Source tag issues

> >       kIRG_GSource
> >       kIRG_HSource
> >       kIRG_JSource
> >       kIRG_KSource
> >       kIRG_KPSource
> >       kIRG_TSource
> >       kIRG_VSource

We would need the U Source as well, don't we? Or is this not normative
except for compatibility CJK characters?

And there is deceptive complexity hiding here as well. Once again,
the simplest approach would be to just turn each of these tags
into a distinct property, and then use their current string values
as their values.

But then collectively, they define, by implication, an enumerated
IRG_Source type property, presumably enumerated as
[G, H, J, K, KP, T, V, U? ].

But wait, if I have a data entry:

U+2F9F6 kIRG_TSource 5-5F5E

That corresponds to T Source: "T5 CNS 11643-1992, plane 5"
And there is another implied enumerated type property for the
T Source: [T1, T2, T3, T4, T5, T6, T7, TF]. And so on for the
other sources. That, in fact, is the actual structure of the
listing in 10646 Clause 27.1 (normative) of the CJK Unified
Ideograph sources. (Note also that in my example above, we are
talking about a *compatibility* CJK character -- the compatibility
CJK characters are also now all listed in Unihan.txt, and that
tosses a further curve at us regarding what the status of the
IRG_Source tag field meanings are here for the unified characters
as opposed to the compatibility characters.

Once again, as for the Han_Numeric, I don't think this property (or
set of properties) is ready for prime time until we work through all
the implications with a detailed proposal.