Re: PRI #200: Draft UTR #49, Unicode Character Categories

From: Philippe Verdy <>
Date: Thu, 14 Jul 2011 03:08:03 +0200

2011/7/14 <>:
> The Unicode Technical Committee has posted a new issue for public review and
> comment. Details are on the following web page:
> Review period for the new item closes on July 27, 2011.
> Please see the page for links to discussion and relevant documents. Briefly,
> the new issue is:
> PRI #200 Draft UTR #49: Unicode Character Categories

Here is a copy of my comment posted to the Online Report (this may
still be commented) :


It looks like the subcatories for [Letter] are not very well
formulated in the current CharacterCategories.txt datafile, and in
fact inconsistant.

The most obvious level-2 suncategory should include [Consonnant],
[Vowel], and [Half-consonnant]. Other distinctions like [Digraph]
should be moved in a lower category.

Note that [Consonnant] has been applied to the full basic Arabic
abjad, but not to the similar Hebrew abjad.

In fact, it also should make distinctions between true [Consonnant]s
and [Half-consonant]s, the later including letters that can act either
as consonnants (acting like a mute or stop consonnant with a default
inherent or implied vowel, possibly modified by acting as an holder
for an optional vocalic diacritic/mark), or as vowels (e.g. Alef and
Yod in Arabic or Hebrew; Y in Latin; RA and LA in Indic scripts),
depending on their context.

Yes, it may be fuzzy with some languages using the same script (e.g. W
in German is undoubtly a consonnant, but in many languages this is
most often a gliding consonnant ; or V in Roman Latin where there was
no distinction with U; but at least, categorizing as [Half-consonnant]
will trigger the ambiguity of its use.

Then the third level should be for case distinctions [Lowercase],
[Uppercase], [Titlecase] and [Uncased] (in scripts that have case

The last level can then be used for [Ligatured] (such as and , even
if they are still considered as a plain letter, this still allows
spcific languages to consider them as letter pairs for collation
purpose), [Digraph] (such as IJ), [Final] (e.g. Greek final sigma)

The content of this (informative) file should also be consistant with
the content of the DUCET (which obviously contain case distinctions at
the third level). However secondary differences exposed inthe ducet
(e.g. for diacritic differences) should probably not be categorized.

And like the DUCET, it should be tailorable in applications or in
specific languages (for example in the CLDR database), so that these
categories are just the default ones used when there's no tailoring. I
do think that such possible tailoring should be explicitly in the
draft UTR #49 !


-- Philippe.
Received on Wed Jul 13 2011 - 20:10:05 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 13 2011 - 20:10:06 CDT