L2/10-452

Feedback on the Proposed Draft Unicode Technical Report #49 Unicode Character Categories

Date: October 31, 2010
Source: Asmus Freytag

The current document contains feedback on UTR#49 draft and the character classification project that it describes.

Background and Problem Statement

The UTR draft describes a four level, hierarchical (tree-structured) classification of characters. As stated in the UTR, it is intended to improve on the weaknesses of the primary classification mechanism for characters, the General category.

The proposed hierarchical approach would be expected to work reasonably well with Letters, but will run into its first problem with punctuation and symbols.

The problem inherent in any single classification scheme is that it can have only a single focus. Thus it is unable to accommodate different view-points that arise from ambiguous use of characters. What has been sketched in UTR#49 and the associated draft data file for the letters is essentially a functional classification (leaving aside whether the focus is on function in terms of the content of written speech, or on the function within the writing system - including typographic aspects). For letters it may even be possible to classify based on different functional aspects on different levels of the proposed hierarchy, for example, VOWEL relates to the function of the element in the language, and DEPENDENT relates to its function in the typography of the writing system.

For punctuation, this kind of scheme breaks down badly, because the re-use of the same entity for entirely unrelated purposes is rampant. For example, U+002E is at once sentence ending punctuation, a word terminal punctuation (abbreviation), a number separator with locale dependent meaning and part of several multiple-dot sequences of an occasional ad-hoc nature (ellipsis spelled out, "range" notation, etc. etc.) In the draft classification, this issue is circumvented by not providing any finer classification then "Punctuation" for U+002E:

002E Po [Punctuation] [X] [X] [X] FULL STOP 

This listing follows the format of the draft data file which is explained in the draft. The second field is the existing classification by general category, followed by the four levels or proposed classification.

The Problem of Classifying Symbols

For Symbols the problem gets worse. Here there exists an inherent tug-of-war between a desire to classify symbols visually (by characteristics of their appearance) on the one hand and functionally (by their typical use) on the other hand. The following illustrates the problem:

003C Sm [Symbol] [Math] [Relation] [X] LESS-THAN SIGN 
003D Sm [Symbol] [Math] [Relation] [X] EQUALS SIGN 
003E Sm [Symbol] [Math] [Relation] [X] GREATER-THAN SIGN 

Here, the new classification at first appears to add valuable input over the general category by classifying the Sm by their function in mathematical notation (Relation).

However, a large class of mathematical operators for relations are expressed symbolically by the use of arrows, and these are all classified visually, as are the following examples:

21D0 So [Symbol] [Arrow] [Double] [X] LEFTWARDS DOUBLE ARROW 
21D2 Sm [Symbol] [Arrow] [Double] [X] RIGHTWARDS DOUBLE ARROW

The latter set of examples happen to delete a distinction between Sm and So (which in this case was probably a spurious distinction anyway) . But as the two sets of examples show, nothing really useful has been gained by the new classification. Like the Sm classification before, [Symbol][Math] doesn't capture *all* the math symbols, only those that are "not much used for something else". [Relation] doesn't help me much either, because it leaves out all the arrows, and therefore as a classification it doesn't help one identify all relations.

This kind of issue is not limited to mathematical symbols. Many technical symbols have been (or could be) unified with geometrical symbols. There is a group of well known user interface symbols for the controls on audio and video devices. The medium  black circle has been unified with the UI symbols for and Record and Stop can be represented by the medium black square, although is not so called out in the nameslist.

23E9	BLACK RIGHT-POINTING DOUBLE TRIANGLE
	= fast forward
23EA	BLACK LEFT-POINTING DOUBLE TRIANGLE
	= fast rewind
25FC	BLACK MEDIUM SQUARE
	x (black square - 25A0)
26AB 	MEDIUM BLACK CIRCLE
	* UI symbol for record function

Comparing the proposed classification for these gives some additional detail over the general category, and correctly identifies 26AB as "geometrical" even though it had been coded in the Miscellaneous Symbols block for expediency. However, only the symbols for fast forward and fast reverse are identified as interface symbols, and only the circle is accorded a subclassification by shape family in the geometrical shapes - the square is not.

23E9  So [Symbol] [Technical] [Interface]	[X]	BLACK RIGHT-POINTING DOUBLE TRIANGLE
23EA  So [Symbol] [Technical] [Interface]	[X]	BLACK LEFT-POINTING DOUBLE TRIANGLE
25FC  Sm [Symbol] [Geometric] [X]		[X]	BLACK MEDIUM SQUARE
26AB  So [Symbol] [Geometric] [Circle]	[X]	MEDIUM BLACK CIRCLE

As can be seen from this set of examples, [Technical], like  [Math] in the earlier examples is conceived of in the draft classification data as a "leftover" category: Only symbols that fit none of the other categories (such as [Geometric]) are classified as [Technical] or [Math], even though many, many more symbols have strong mathematical or technical use.

More Useful Approaches

For symbols, it would be more useful to provide a set of overlapping classifications that can exist in parallel for the same symbol. One classification should be one based on appearance. Such a classification has a chance of allowing a true (hierarchical) partition of the symbol space (and a similar scheme could handle punctuation as well). It would have another advantage, in that it would be immediately useful as a means for grouping characters for "pick lists". At the innermost level, as a "tie breaker" among confusable symbols, some usage classification might still be useful in an otherwise visual scheme.

Separate classification systems, such as that developed in mathclass.txt (see in http://www.unicode.org/Public/math/) should then be used to classify symbols once more by usage. However, this time, more than one primary usage should be allowed. Arrows, used in math, would get one classification ([Relation) while arrows used as part of Technical Symbols would get another ([Chemical reaction], etc.). Unlike the visual classification, such classifications cannot be carried out across the entire symbol space. Therefore, they might as well continue to reside in parallel data files (such as math class) although we could discuss whether a consistent presentation of such categorization efforts are useful - for example making some implicit categories explicit (e.g. "Symbol" "math" for most of mathclass - except digits etc).

Concluding from the problems presented so far, I disagree with the statement in the draft UTR that the newly proposed scheme fundamentally improves on the general category. I think where the general category has category "other" some of the further classifications suggested are going in the right direction. In the following example one might add "Straight" or some other inner-level appearance classifier .

21D0  So [Symbol] [Arrow] [Double] [Straight] LEFTWARDS DOUBLE ARROW 

However, to make that scheme useful, the second level classification has to be a true partition. All symbols based on arrows need to be of class [Arrow] which allows parallel categories like [Snowflake] or [Astrological Symbol] but not [Math] or [Technical]. Instead of [Technical] one might need to dig deeper and distinguish between symbols that are "pictorial" "schematic" or, yes, "symbolic". Those that are currently [Math] could be classified by related shape, e.g all the "tacks" make a series, all the "Integral" etc.

The following two sets of examples show instances of symbols that are represent by a variation of the [Corner] shape, which is a sub-class of the [Line]. Here is the way these symbols are classified in the proposed scheme today:

221F	Sm	[Symbol]	[Math]	[X]	[X]	RIGHT ANGLE
2308	Sm	[Symbol]	[Math] 	[X] 	[X] 	LEFT CEILING 
2309	Sm	[Symbol]	[Math] 	[X] 	[X] 	RIGHT CEILING 
230A	Sm	[Symbol]	[Math] 	[X] 	[X] 	LEFT FLOOR 
230B	Sm	[Symbol]	[Math] 	[X] 	[X] 	RIGHT FLOOR 
231C	So	[Symbol]	[Technical] [Quine corner]	[X]	TOP LEFT CORNER
231D	So	[Symbol]	[Technical] [Quine corner]	[X]	TOP RIGHT CORNER
231E	So	[Symbol]	[Technical] [Quine corner]	[X]	BOTTOM LEFT CORNER
231F	So	[Symbol]	[Technical] [Quine corner]	[X]	BOTTOM RIGHT CORNER

250C	So	[Symbol]	[Graphic]	[Form]	[X]	BOX DRAWINGS LIGHT DOWN AND RIGHT
2510	So	[Symbol]	[Graphic]	[Form]	[X]	BOX DRAWINGS LIGHT DOWN AND LEFT
2514	So	[Symbol]	[Graphic]	[Form]	[X]	BOX DRAWINGS LIGHT UP AND RIGHT
2518	So	[Symbol]	[Graphic]	[Form]	[X]	BOX DRAWINGS LIGHT UP AND LEFT

As can be seen, [Math], [Technical] and [Graphic] are used as the second level classification, where [Graphic] is a leftover classiciation in this context, because all symbols are also graphic. The following presents a possible alternate classification that consistently is based on the shape classification. In this case [Line] would collect all symbols that are essentially drawn by straight or curved linear elements not connected into closed shapes. [Corner] is just what the name says. At the innermost level, the final classifcation attempts to disambiguate visually similar symbols by usage categories. Because this is the innermost level, the usage indicators don't need to be gobal (they no longer partition all symbols, only the corners). Nevertheless, where possible, a specific subset, such as [Quine corner] is more useful than a generic one like [Math]:

221F	Sm	[Symbol]	[Line]	[Corner]	[Math]	RIGHT ANGLE
2308	Sm	[Symbol]	[Line] 	[Corner] 	[Math] 	LEFT CEILING 
2309	Sm	[Symbol]	[Line] 	[Corner] 	[Math] 	RIGHT CEILING 
230A	Sm	[Symbol]	[Line] 	[Corner] 	[Math] 	LEFT FLOOR 
230B	Sm	[Symbol]	[Line] 	[Corner] 	[Math] 	RIGHT FLOOR 
231C	So	[Symbol]	[Line]	[Corner]	[Quine corner]	TOP LEFT CORNER
231D	So	[Symbol]	[Line]	[Corner]	[Quine corner]	TOP RIGHT CORNER
231E	So	[Symbol]	[Line]	[Corner]	[Quine corner]	BOTTOM LEFT CORNER
231F	So	[Symbol]	[Line]	[Corner]	[Quine corner]	BOTTOM RIGHT CORNER

250C	So	[Symbol]	[Line]	[Corner]	[Form] 	BOX DRAWINGS LIGHT DOWN AND RIGHT
2510	So	[Symbol]	[Line]	[Corner]	[Form]	BOX DRAWINGS LIGHT DOWN AND LEFT
2514	So	[Symbol]	[Line]	[Corner]	[Form]	BOX DRAWINGS LIGHT UP AND RIGHT
2518	So	[Symbol]	[Line]	[Corner]	[Form]	BOX DRAWINGS LIGHT UP AND LEFT

An alternative to the [Math] for the 2308-230B range could have been [Delimiter].

There will undoubtedly be the inevitable edge cases were symbols would appear to be equally belong to more than one category. As long as these are rare, they would not tend to invalidate the whole scheme.

A Possible Alternative

An alternative would be to do away with a more or less fixed set of categories and also do away with the idea of a partition at each level. Perhaps even do away with the notion of the  hierarchical aspect of the classification. That would be a very flexible scheme, and it would allow the user to "sort" the characters based on whatever category fits the problem at hand. For example, to locate all [Math] characters or all [Circle], or all [Bold] characters.

No longer would there be the problem that a character needs to be shoehorned into only one category (or one set of categories), nor would there be the problem of having categories applied to only some characters, when they clearly would also apply to other characters. This is the problem I tried to capture with the term "leftover category" and which is a fundamental weakness of the General Category.

But what would make such a new classification then different from "just a bunch of Boolean properties", such as they exist today already in DerivedCorePorperties.txt?

I see the main difference in not throwing all possible labels at the characters - because that is the role of the Unicode Character Database as a whole. Rather, one would pick a set of primary divisions (symbol/letter) and a set of sufficiently descriptive classification that relate to the identity of the character (the purpose it was encoded for), rather than it's specialize function in a given algorithm or processing context. Ideally, given the glyph, name and classification it should be possible to unambiguously relate the encoded character to it's real world counter part, or (for the majority of cases) to answer the question "which of these is it, that was intended for this purpose?"

For symbols, because of the lack of shared linguistic constraints, such classification would probably be largely focused on appearance, while for letters, syllables and ideographs, the study of writing systems will provide the necessary categories.

For cases where letters are used as symbols, one would no longer be forced to choose

1D400  Lu  [Symbol]	[Math]	[Alphanumeric]	[Bold]	MATHEMATICAL BOLD CAPITAL A
1D401  Lu  [Symbol]	[Math]	[Alphanumeric]	[Bold]	MATHEMATICAL BOLD CAPITAL B
1D402  Lu  [Symbol]	[Math]	[Alphanumeric]	[Bold]	MATHEMATICAL BOLD CAPITAL C

in contrast to:

0041  Lu  [Letter]	[X]	[X]	[X]	LATIN CAPITAL LETTER A
0042  Lu  [Letter]	[X]	[X]	[X]	LATIN CAPITAL LETTER B
0043  Lu  [Letter]	[X]	[X]	[X]	LATIN CAPITAL LETTER C

Instead something like this would be possible:

0041  Lu [Letter]	[Uppercase]	[X]	[X]	LATIN CAPITAL LETTER A
0042  Lu [Letter]	[Uppercase]	[X]	[X]	LATIN CAPITAL LETTER B
0043  Lu [Letter]	[Uppercase]	[X]	[X]	LATIN CAPITAL LETTER C
1D400  Lu [Symbol  /  Letter]	[Uppercase]	[Bold]	[Math]	MATHEMATICAL BOLD CAPITAL A
1D401  Lu [Symbol  /  Letter]	[Uppercase]	[Bold]	[Math]	MATHEMATICAL BOLD CAPITAL B
1D402  Lu [Symbol  /  Letter]	[Uppercase]	[Bold]	[Math]	MATHEMATICAL BOLD CAPITAL C

If possible, classifications such as [Bold] or [Math] in the last set of examples would be chosen so that they can be applied universally, by which I mean that all characters that are either bold or of mathematical use would carry the respective tags, or else, the use of that tag should be avoided altogether. Different terminology would be unified, for example some dingbats use "heavy", which could be unified with bold for all practical purposes. But the principle should be, if a distinction based on font weight is made anywhere, that feature would be marked for all characters where it applies.

That is different from the "corner" example above, that still assumed a need for hierarchical classification and so was forced to allow some categories to not be universal. Because the non-overlapping, non-universal nature of the distinctions in the General category has been one of its principle sources of weakness, I would like to encourage the UTC to not stop at half measures in an attempt to augment or even replace it.

Rather than going merely from 2 to four levels, in an otherwise nearly identical scheme, it is time to recognize the problems inherent in a hierarchical classification of elements that - unlike plants and animals - have not clear inheritance relation to common ancestors. In particular, every effort should be made to reduce or eliminate the use of "leftover" categories, which here, does not mean "other", but "all characters of type X that are not already of type A, B or C" as exemplified in the use of [Math] or [Technical] in the proposed draft classification.