*Date: October 31, 2010
Source: Asmus Freytag*

The current document contains feedback on UTR#49 draft and the character classification project that it describes.

The UTR draft describes a four level, hierarchical (tree-structured) classification of characters. As stated in the UTR, it is intended to improve on the weaknesses of the primary classification mechanism for characters, the General category.

The proposed hierarchical approach would be expected to work reasonably well with Letters, but will run into its first problem with punctuation and symbols.

The problem inherent in any single classification scheme is that it can have only a single focus. Thus it is unable to accommodate different view-points that arise from ambiguous use of characters. What has been sketched in UTR#49 and the associated draft data file for the letters is essentially a functional classification (leaving aside whether the focus is on function in terms of the content of written speech, or on the function within the writing system - including typographic aspects). For letters it may even be possible to classify based on different functional aspects on different levels of the proposed hierarchy, for example, VOWEL relates to the function of the element in the language, and DEPENDENT relates to its function in the typography of the writing system.

For punctuation, this kind of scheme breaks down badly, because the re-use of the same entity for entirely unrelated purposes is rampant. For example, U+002E is at once sentence ending punctuation, a word terminal punctuation (abbreviation), a number separator with locale dependent meaning and part of several multiple-dot sequences of an occasional ad-hoc nature (ellipsis spelled out, "range" notation, etc. etc.) In the draft classification, this issue is circumvented by not providing any finer classification then "Punctuation" for U+002E:

002E Po [Punctuation] [X] [X] [X] FULL STOP

This listing follows the format of the draft data file which is explained in the draft. The second field is the existing classification by general category, followed by the four levels or proposed classification.

For Symbols the problem gets worse. Here there exists an inherent tug-of-war between a desire to classify symbols visually (by characteristics of their appearance) on the one hand and functionally (by their typical use) on the other hand. The following illustrates the problem:

003C Sm [Symbol] [Math] [Relation] [X] LESS-THAN SIGN 003D Sm [Symbol] [Math] [Relation] [X] EQUALS SIGN 003E Sm [Symbol] [Math] [Relation] [X] GREATER-THAN SIGN

Here, the new classification at first appears to add valuable input over the general category by classifying the Sm by their function in mathematical notation (Relation).

However, a large class of mathematical operators for relations are expressed
symbolically by the use of **arrows**, and these are all classified visually,
as are the following examples:

21D0 So [Symbol] [Arrow] [Double] [X] LEFTWARDS DOUBLE ARROW 21D2 Sm [Symbol] [Arrow] [Double] [X] RIGHTWARDS DOUBLE ARROW

The latter set of examples happen to delete a distinction between Sm and So (which in this case was probably a spurious distinction anyway) . But as the two sets of examples show, nothing really useful has been gained by the new classification. Like the Sm classification before, [Symbol][Math] doesn't capture *all* the math symbols, only those that are "not much used for something else". [Relation] doesn't help me much either, because it leaves out all the arrows, and therefore as a classification it doesn't help one identify all relations.

This kind of issue is not limited to mathematical symbols. Many technical symbols have been (or could be) unified with geometrical symbols. There is a group of well known user interface symbols for the controls on audio and video devices. The medium black circle has been unified with the UI symbols for and Record and Stop can be represented by the medium black square, although is not so called out in the nameslist.

23E9 BLACK RIGHT-POINTING DOUBLE TRIANGLE = fast forward 23EA BLACK LEFT-POINTING DOUBLE TRIANGLE = fast rewind

25FC BLACK MEDIUM SQUARE x (black square - 25A0)

26AB MEDIUM BLACK CIRCLE * UI symbol for record function

Comparing the proposed classification for these gives some additional detail
over the general category, and correctly identifies 26AB as "geometrical" even
though it had been coded in the Miscellaneous Symbols block for expediency.
However, only the symbols for *fast forward* and *fast reverse* are
identified as interface symbols, and only the circle is accorded a
subclassification by shape family in the geometrical shapes - the square is not.

23E9 So [Symbol] [Technical] [Interface] [X] BLACK RIGHT-POINTING DOUBLE TRIANGLE 23EA So [Symbol] [Technical] [Interface] [X] BLACK LEFT-POINTING DOUBLE TRIANGLE

25FC Sm [Symbol] [Geometric] [X] [X] BLACK MEDIUM SQUARE 26AB So [Symbol] [Geometric] [Circle] [X] MEDIUM BLACK CIRCLE

As can be seen from this set of examples, [Technical], like [Math] in the earlier examples is conceived of in the draft classification data as a "leftover" category: Only symbols that fit none of the other categories (such as [Geometric]) are classified as [Technical] or [Math], even though many, many more symbols have strong mathematical or technical use.

For symbols, it would be more useful to provide a set of overlapping classifications that can exist in
**parallel** for
the same symbol. One classification should be one based on appearance. Such a
classification has a chance of allowing a true (hierarchical) partition of the
symbol space (and a similar scheme could handle punctuation as well). It would have
another advantage, in that it would be immediately useful as a means for grouping characters for "pick
lists". At the innermost level, as a "tie breaker" among confusable symbols,
some usage classification might still be useful in an otherwise visual scheme.

Separate classification systems, such as that developed in mathclass.txt (see in http://www.unicode.org/Public/math/) should then be used to classify symbols once more by usage. However, this time, more than one primary usage should be allowed. Arrows, used in math, would get one classification ([Relation) while arrows used as part of Technical Symbols would get another ([Chemical reaction], etc.). Unlike the visual classification, such classifications cannot be carried out across the entire symbol space. Therefore, they might as well continue to reside in parallel data files (such as math class) although we could discuss whether a consistent presentation of such categorization efforts are useful - for example making some implicit categories explicit (e.g. "Symbol" "math" for most of mathclass - except digits etc).

Concluding from the problems presented so far, I disagree with the statement in the draft UTR that the newly proposed scheme fundamentally improves on the general category. I think where the general category has category "other" some of the further classifications suggested are going in the right direction. In the following example one might add "Straight" or some other inner-level appearance classifier .

21D0 So [Symbol] [Arrow] [Double] [Straight] LEFTWARDS DOUBLE ARROW

However, to make that scheme useful, the second level classification has to be a true partition. All symbols based on arrows need to be of class [Arrow] which allows parallel categories like [Snowflake] or [Astrological Symbol] but not [Math] or [Technical]. Instead of [Technical] one might need to dig deeper and distinguish between symbols that are "pictorial" "schematic" or, yes, "symbolic". Those that are currently [Math] could be classified by related shape, e.g all the "tacks" make a series, all the "Integral" etc.

The following two sets of examples show instances of symbols that are represent by a variation of the [Corner] shape, which is a sub-class of the [Line]. Here is the way these symbols are classified in the proposed scheme today:

221F Sm [Symbol] [Math] [X] [X] RIGHT ANGLE

2308 Sm [Symbol] [Math] [X] [X] LEFT CEILING 2309 Sm [Symbol] [Math] [X] [X] RIGHT CEILING 230A Sm [Symbol] [Math] [X] [X] LEFT FLOOR 230B Sm [Symbol] [Math] [X] [X] RIGHT FLOOR

231C So [Symbol] [Technical] [Quine corner] [X] TOP LEFT CORNER 231D So [Symbol] [Technical] [Quine corner] [X] TOP RIGHT CORNER 231E So [Symbol] [Technical] [Quine corner] [X] BOTTOM LEFT CORNER 231F So [Symbol] [Technical] [Quine corner] [X] BOTTOM RIGHT CORNER 250C So [Symbol] [Graphic] [Form] [X] BOX DRAWINGS LIGHT DOWN AND RIGHT 2510 So [Symbol] [Graphic] [Form] [X] BOX DRAWINGS LIGHT DOWN AND LEFT 2514 So [Symbol] [Graphic] [Form] [X] BOX DRAWINGS LIGHT UP AND RIGHT 2518 So [Symbol] [Graphic] [Form] [X] BOX DRAWINGS LIGHT UP AND LEFT

As can be seen, [Math], [Technical] and [Graphic] are used as the second
level classification, where [Graphic] is a leftover classiciation in this
context, because all symbols are also graphic. The following presents a possible
alternate classification that consistently is based on the shape classification.
In this case [Line] would collect all symbols that are essentially drawn by
straight or curved linear elements not connected into closed shapes. [Corner] is
just what the name says. At the innermost level, the final classifcation
attempts to disambiguate visually similar symbols by usage categories. Because
this is the innermost level, the usage indicators don't need to be gobal (they
no longer partition ** all** symbols, only the corners). Nevertheless,
where possible, a specific subset, such as [Quine corner] is more useful than a
generic one like [Math]:

221F Sm [Symbol] [Line] [Corner] [Math] RIGHT ANGLE

2308 Sm [Symbol] [Line] [Corner] [Math] LEFT CEILING 2309 Sm [Symbol] [Line] [Corner] [Math] RIGHT CEILING 230A Sm [Symbol] [Line] [Corner] [Math] LEFT FLOOR 230B Sm [Symbol] [Line] [Corner] [Math] RIGHT FLOOR

231C So [Symbol] [Line] [Corner] [Quine corner] TOP LEFT CORNER 231D So [Symbol] [Line] [Corner] [Quine corner] TOP RIGHT CORNER 231E So [Symbol] [Line] [Corner] [Quine corner] BOTTOM LEFT CORNER 231F So [Symbol] [Line] [Corner] [Quine corner] BOTTOM RIGHT CORNER 250C So [Symbol] [Line] [Corner] [Form] BOX DRAWINGS LIGHT DOWN AND RIGHT 2510 So [Symbol] [Line] [Corner] [Form] BOX DRAWINGS LIGHT DOWN AND LEFT 2514 So [Symbol] [Line] [Corner] [Form] BOX DRAWINGS LIGHT UP AND RIGHT 2518 So [Symbol] [Line] [Corner] [Form] BOX DRAWINGS LIGHT UP AND LEFT

An alternative to the [Math] for the 2308-230B range could have been [Delimiter].

There will undoubtedly be the inevitable edge cases were symbols would appear to be equally belong to more than one category. As long as these are rare, they would not tend to invalidate the whole scheme.

An alternative would be to do away with a more or less fixed set of categories and also do away with the idea of a partition at each level. Perhaps even do away with the notion of the hierarchical aspect of the classification. That would be a very flexible scheme, and it would allow the user to "sort" the characters based on whatever category fits the problem at hand. For example, to locate all [Math] characters or all [Circle], or all [Bold] characters.

No longer would there be the problem that a character needs to be shoehorned into only one category (or one set of categories), nor would there be the problem of having categories applied to only some characters, when they clearly would also apply to other characters. This is the problem I tried to capture with the term "leftover category" and which is a fundamental weakness of the General Category.

But what would make such a new classification then different from "just a bunch of Boolean properties", such as they exist today already in DerivedCorePorperties.txt?

I see the main difference in not throwing **all** possible labels at the
characters - because that is the role of the Unicode Character Database as a
whole. Rather, one would pick a set of primary divisions (symbol/letter) and a
set of sufficiently descriptive classification that relate to the identity of
the character (the purpose it was encoded for), rather than it's specialize
function in a given algorithm or processing context. Ideally, given the glyph,
name and classification it should be possible to unambiguously relate the
encoded character to it's real world counter part, or (for the majority of
cases) to answer the question "which of these is it, that was intended for this
purpose?"

For symbols, because of the lack of shared linguistic constraints, such classification would probably be largely focused on appearance, while for letters, syllables and ideographs, the study of writing systems will provide the necessary categories.

For cases where letters are used as symbols, one would no longer be forced to choose

1D400 Lu [Symbol] [Math] [Alphanumeric] [Bold] MATHEMATICAL BOLD CAPITAL A 1D401 Lu [Symbol] [Math] [Alphanumeric] [Bold] MATHEMATICAL BOLD CAPITAL B 1D402 Lu [Symbol] [Math] [Alphanumeric] [Bold] MATHEMATICAL BOLD CAPITAL C

in contrast to:

0041 Lu [Letter] [X] [X] [X] LATIN CAPITAL LETTER A 0042 Lu [Letter] [X] [X] [X] LATIN CAPITAL LETTER B 0043 Lu [Letter] [X] [X] [X] LATIN CAPITAL LETTER C

Instead something like this would be possible:

0041 Lu [Letter] [Uppercase] [X] [X] LATIN CAPITAL LETTER A 0042 Lu [Letter] [Uppercase] [X] [X] LATIN CAPITAL LETTER B 0043 Lu [Letter] [Uppercase] [X] [X] LATIN CAPITAL LETTER C

1D400 Lu [Symbol / Letter] [Uppercase] [Bold] [Math] MATHEMATICAL BOLD CAPITAL A 1D401 Lu [Symbol / Letter] [Uppercase] [Bold] [Math] MATHEMATICAL BOLD CAPITAL B 1D402 Lu [Symbol / Letter] [Uppercase] [Bold] [Math] MATHEMATICAL BOLD CAPITAL C

If possible, classifications such as [Bold] or [Math] in the last set of
examples would be chosen so that they can be applied universally, by which I
mean that **all** characters that are either bold or of mathematical use
would carry the respective tags, or else, the use of that tag should be avoided
altogether. Different terminology would be unified, for example some dingbats
use "heavy", which could be unified with bold for all practical purposes. But
the principle should be, if a distinction based on font weight is made anywhere,
that feature would be marked for all characters where it applies.

That is different from the "corner" example above, that still assumed a need for hierarchical classification and so was forced to allow some categories to not be universal. Because the non-overlapping, non-universal nature of the distinctions in the General category has been one of its principle sources of weakness, I would like to encourage the UTC to not stop at half measures in an attempt to augment or even replace it.

Rather than going merely from 2 to four levels, in an otherwise nearly identical scheme, it is time to recognize the problems inherent in a hierarchical classification of elements that - unlike plants and animals - have not clear inheritance relation to common ancestors. In particular, every effort should be made to reduce or eliminate the use of "leftover" categories, which here, does not mean "other", but "all characters of type X that are not already of type A, B or C" as exemplified in the use of [Math] or [Technical] in the proposed draft classification.