L2/10-412
PROBLEM REPORT: MathClass-11/12.txt

Source: Asmus Freytag
Date: 2010-October-15, revised Oct. 17

I've been trying to read the MathClass-11.txt file, which is associated with UTR#25, with my standard parsers for UCD data files. In the process, I've come across a few issues. (I've since reviewed the draft file MathClass-12.txt  - the following list contains only issues that are still outstanding)

1) A comment near line 34 states:

"The code point field is not unique - where ISO entity sets overlap;
# the same code point may occur several times with different mappings"

This file has no ISO mappings (unlike the MathClassEX-11/12.txt file) .Therefore there is no rationale for allowing non-unique code points.  Some parsers check to ensure that enumerations have unique assignments, this allows error detection, see issue 6 (duplicated 20D4).

Suggestion: remove the caveat; also ensure code points are, in fact, unique.

2) The code point 21EA is present, but the math class value is missing. This character is an arrow from keyboard layouts.

Suggestion: this character should probably be removed from the file.

3) The range 2B00..2B11 uses R? as the property value. This is undefined.

Suggestion: remove the ?

4) The naming convention for the files are problematic. Currently the version # (or revision #) is part of the file name. This is a scheme that was abandoned for the UCD because it's unwieldy.

Suggestion: switch to the same practice as the UCD, which is to only number the folder, not the file.

5)  UTR#25 states: "The data file specifies the version of [Unicode] to which it has been updated". However, such a version specification is missing in the file.

Suggestion:  prominently add this line:
                    # Updated to reflect character repertoire of Unicode X.Y

6) UTR#25 states: "All characters that have the Math property are covered by this classification." However, a quick comparison of the math property with the set of characters classified uncovers several discrepancies.

Here is the list of characters that have the "math" property in the UCD for 6.0 correlated to characters that have an entry in the MathClass file. The list consists of those that have the math property, but not any math class.

0606..0608     ARABIC-INDIC CUBE ROOT..ARABIC RAY
20D5     COMBINING CLOCKWISE ARROW ABOVE
237C     RIGHT ANGLE WITH DOWNWARDS ZIGZAG ARROW
25B0     BLACK PARALLELOGRAM
25D0..25D3     CIRCLE WITH LEFT HALF BLACK..UPPER HALF BLACK
25E7..25EA     SQUARE WITH LEFT HALF BLACK..LOWER RIGHT DIAGONAL HALF BLACK
27CA     VERTICAL BAR WITH HORIZONTAL STROKE
27CC     LONG DIVISION
2B30..2B44     LEFT ARROW WITH SMALL CIRCLE..RIGHTWARDS ARROW THROUGH SUPERSET
2B47..2B4C     REVERSE TILDE OPERATOR ABOVE RIGHTWARDS ARROW..RIGHTWARDS ARROW ABOVE REVERSE TILDE OPERATOR

A few notes on the difference:

a) the Arabic operators are not given a math class

b) why is 20D5 different from 20D4?

c) why are these particular characters in the 2500 block treated differently

d) the 27C0 block is ostensibly about math, but MathClass doesn't treat 2 characters in it.

e) what's the rationale for when left pointing arrows are excluded from MathClass?
    Some of these have specific mirror-image shapes.

Suggestion:

A) if there are characters that are truly mathematical, but the use of which is limited to RTL math display, then why not introduce a new class for them in Math Class? The letter "M" for "Mirrored - used only with RTL math layout where operators are mirrored" would seem to still be available.

B) if you adopt suggestion A, then the remaining characters are few, essentially 237C and those in the 2500 block.

For the 2500 block, most of these characters would appear either to be "N" or, if they are emphatically not mathematical, then they could become members of a new class of "math property but not properly classified as mathematical character" This might apply to 237C. Such a move would rationalize the difference without changing the implicit meaning of any existing math class.

If the 2500 block characters were to fit the N class, then UTC could be petitioned to drop 237C from the math property, if the latter character really isn't mathematical

C) Resolve the apparent oversight for 20D5 (in the file, 20D4 is duplicated, so this looks like a typo)

7) It is unclear which of these issues also apply the "Ex" version of the file - I have not correlated the two file versions.

Suggestion: make sure these two files agree after modifications.