L2/09-280 Title: Maintaining a Typology of Unicode Characters Source: Ken Whistler Date: August 6, 2009 Action: For consideration by the UTC Background At the last UTC meeting (and prior to that briefly on the UTC discussion list) there was some consideration of the problems in extracting good "categories" for Unicode characters out of the Unicode names list. This was occasioned by the need to develop new character picker applications, which need to organize characters into groups that will make sense for people to find characters in graphic panes or other UI elements. The problem is two-fold. First, the machine-readable data files don't provide a fine enough categorization to meet the requirements. For example, the General_Category property will distinguish letters from combining marks and punctuation and symbols, but it doesn't drill down to the next level: independent vowel letters versus consonants versus matras; or game symbols versus map symbols versus zodiacal symbols versus dingbats; and so on. Second, people who need that kind of finer detail of categorization have been attempting to extract it by making use of the editorial subheaders used in the printing of the Unicode names list, figuring that that information is better than nothing -- and attempting to do the finer-level classification from scratch seems prohibitively complex. The fact is, however, that the subheaders in the Unicode names list were always editorial content aimed more at structuring the code charts for display, and are not particularly well-suited to a systematic categorization of Unicode characters in any context more extensive than considering characters one chart at a time. Efforts to revise the subheaders to make them "work better" for machine-extracted categorization of Unicode characters from the Unicode names list are, IMO, counterproductive. They wouldn't work very well that way, and the net result would be a significant deterioration of the editorial content of the code charts. Proposal I'm suggesting another way. The same program that is used to maintain the Unicode names list can be repurposed to use another annotation data file as input to an automated merger of annotations and the UnicodeData.txt file, producing as output a structured data file containing typological information about all Unicode characters, already in suitable format for direct import into a spreadsheet. Once in a spreadsheet, it can easily be further manipulated to whatever end an implementer needs. The scheme I have in mind would use a hierarchical typology, which would be extensible based on what level of detail folks find it useful to maintain for various characters. For example: Letter Letter > Vowel Letter > Vowel > Dependent (i.e. Indic matras) Letter > Consonant > Dependent > Subjoined and so on, or for symbols: Symbol Symbol > Graphic Symbol > Technical Symbol > Technical > Keyboard Symbol > Arrow Symbol > Arrow > Harpoon Symbol > Arrow > Harpoon > Double or Punctuation Punctuation > Bracket Punctuation > Bracket > CJK and so on. The merged data would be formatted into tab- or comma-delimited fields, somewhat like this: Code GC Level1 Level2 Level3 Level4 Name 23CE So Symbol Technical Keyboard RETURN SYMBOL ... 2460 No Symbol Number Circled CIRCLED DIGIT ONE ... 25CB So Symbol Geometric WHITE CIRCLE ... 2602 So Symbol Weather UMBRELLA ... 260A So Symbol Astrological ASCENDING NODE ... 2660 So Symbol Game Playing card Suit BLACK SPADE SUIT ... 2FBD So Ideograph Radical CJK Kangxi KANGXI RADICAL HAIR ... A869 Lo Letter Consonant PHAGS-PA LETTER TTA ... For all Unicode characters. Note that the existing subheaders often clump characters. For example, the header for the range U+2600..U+260D is "Weather and astrological symbols". But as the example above shows, we can do much better, distinguishing more precisely those which are weather symbols, such as U+2602 UMBRELLA, those which are astrological symbols, such as U+260A ASCENDING NODE, and those which really aren't either, such as U+2606 WHITE STAR. Currently I'm working with four levels of typology, but this could easily be extended to five (or more), if finer levels of distinction for some groups of characters proved to be desirable. For example, arrows could be subcategorized based on their shapes and orientations. The first key here is staying flexible, so that the typology can be extended and modified easily in the future, as may prove suitable. Using an annotation + merger with UnicodeData.txt approach makes it very easy to assign new subtypes or change or subdivide ranges already assigned to types and subtypes, without having to do extensive modification of explicit listing files. The second key is corollary to the first: this MUST not turn into another normative data file and/or normative set of property values. That is the trap that has always afflicted the General_Category property and which makes it useless for this kind of finer-level categorization. I already have an implementation in hand that can produce this data, and have done a first pass typological classification of all the Unicode characters along these lines. If the UTC is interested in pursuing this, I would suggest developing a draft for a new Unicode Technical Report that could explain the general approach towards maintaining a typology of Unicode characters, explain the data file format, and which would have an associated informative data file that people could use to get this kind of typological information about the characters. The closest analogy among our existing documents is UTR #25, "Unicode Support for Mathematics" and its associated informational data file, MathClass.txt, which classifies mathematical characters by their typographical behavior.