L2/09-280


Title:  Maintaining a Typology of Unicode Characters
Source: Ken Whistler
Date:   August 6, 2009
Action: For consideration by the UTC


Background


At the last UTC meeting (and prior to that briefly on the
UTC discussion list) there was some consideration of the
problems in extracting good "categories" for Unicode
characters out of the Unicode names list. This was occasioned
by the need to develop new character picker applications,
which need to organize characters into groups that will
make sense for people to find characters in graphic panes
or other UI elements.

The problem is two-fold. First, the machine-readable
data files don't provide a fine enough categorization to
meet the requirements. For example, the General_Category
property will distinguish letters from combining marks
and punctuation and symbols, but it doesn't drill down
to the next level: independent vowel letters versus
consonants versus matras; or game symbols versus map
symbols versus zodiacal symbols versus dingbats; and
so on. Second, people who need that kind of finer
detail of categorization have been attempting to extract
it by making use of the editorial subheaders used in
the printing of the Unicode names list, figuring that
that information is better than nothing -- and attempting
to do the finer-level classification from scratch seems
prohibitively complex.

The fact is, however, that the subheaders in the Unicode
names list were always editorial content aimed more at
structuring the code charts for display, and are not
particularly well-suited to a systematic categorization
of Unicode characters in any context more extensive than
considering characters one chart at a time. Efforts to revise the
subheaders to make them "work better" for machine-extracted
categorization of Unicode characters from the Unicode
names list are, IMO, counterproductive. They wouldn't
work very well that way, and the net result would be a
significant deterioration of the editorial content of
the code charts.


Proposal

I'm suggesting another way.

The same program that is used to maintain the Unicode
names list can be repurposed to use another annotation
data file as input to an automated merger of annotations
and the UnicodeData.txt file, producing as output a
structured data file containing typological information
about all Unicode characters, already in suitable format
for direct import into a spreadsheet. Once in a spreadsheet,
it can easily be further manipulated to whatever end
an implementer needs.

The scheme I have in mind would use a hierarchical typology,
which would be extensible based on what level of detail
folks find it useful to maintain for various characters.
For example:

Letter

Letter > Vowel

Letter > Vowel > Dependent  (i.e. Indic matras)

Letter > Consonant > Dependent > Subjoined

and so on, or for symbols:

Symbol

Symbol > Graphic

Symbol > Technical

Symbol > Technical > Keyboard

Symbol > Arrow

Symbol > Arrow > Harpoon

Symbol > Arrow > Harpoon > Double

or

Punctuation

Punctuation > Bracket

Punctuation > Bracket > CJK

and so on.

The merged data would be formatted into tab- or comma-delimited
fields, somewhat like this:

Code GC Level1    Level2       Level3      Level4  Name

23CE So Symbol    Technical    Keyboard            RETURN SYMBOL
...
2460 No Symbol    Number       Circled             CIRCLED DIGIT ONE
...
25CB So Symbol    Geometric                        WHITE CIRCLE
...
2602 So Symbol    Weather                          UMBRELLA
...
260A So Symbol    Astrological                     ASCENDING NODE
...
2660 So Symbol    Game         Playing card Suit   BLACK SPADE SUIT
...
2FBD So Ideograph Radical      CJK          Kangxi KANGXI RADICAL HAIR
...
A869 Lo Letter    Consonant                        PHAGS-PA LETTER TTA
...

For all Unicode characters. Note that the existing
subheaders often clump characters. For example, the
header for the range U+2600..U+260D is "Weather and astrological 
symbols". But as the example above shows, we can do much
better, distinguishing more precisely those which are
weather symbols, such as U+2602 UMBRELLA, those which
are astrological symbols, such as U+260A ASCENDING NODE,
and those which really aren't either, such as U+2606 WHITE STAR.

Currently I'm working with four levels of typology, but
this could easily be extended to five (or more), if finer
levels of distinction for some groups of characters proved
to be desirable. For example, arrows could be subcategorized
based on their shapes and orientations.

The first key here is staying flexible, so that the typology can
be extended and modified easily in the future, as may
prove suitable. Using an annotation + merger with
UnicodeData.txt approach makes it very easy to assign new
subtypes or change or subdivide ranges already assigned to
types and subtypes, without having to do extensive modification
of explicit listing files.

The second key is corollary to the first: this MUST not turn
into another normative data file and/or normative set of
property values. That is the trap that has always afflicted
the General_Category property and which makes it useless for
this kind of finer-level categorization.

I already have an implementation in hand that can produce this
data, and have done a first pass typological classification
of all the Unicode characters along these lines.

If the UTC is interested in pursuing this, I would suggest
developing a draft for a new Unicode Technical Report that
could explain the general approach towards maintaining
a typology of Unicode characters, explain the data file
format, and which would have an associated informative
data file that people could use to get this kind of
typological information about the characters.

The closest analogy among our existing documents is
UTR #25, "Unicode Support for Mathematics" and its
associated informational data file, MathClass.txt, 
which classifies mathematical characters by their typographical 
behavior.