Re: unicodedata-2.0.14.txt

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 12 1998 - 15:02:50 EST


>
> Following is listed in unicodedata-2.0.14.txt:
>
> 00AA FEMININE ORDINAL INDICATOR : general catagory "Ll"
> 00B5 MICRO SIGN : general catagory "Ll"
> 00BA MASCULINE ORDINAL INDICATOR : general catagory "Ll"
>
> This seems to be incorrect. Any comments?

This is not incorrect. These category assignment for these
three characters has been stable as "Ll" since the origination of
the UnicodeData character database in 1993 (UnicodeData-1.2.3.txt),
and has been carried forward into Unicode 2.0 (and the
prospective Unicode 2.1).

Note the *entire* entries:

00AA;FEMININE ORDINAL INDICATOR;Ll;0;ON;<super> 0061;;;;N;;;;;
00BA;MASCULINE ORDINAL INDICATOR;Ll;0;ON;<super> 006F;;;;N;;;;;
00B5;MICRO SIGN;Ll;0;ON;<compat> 03BC;;;;N;;;;;

The decomposition field is important here. Each of these three
characters is a compatibility equivalent to a regular Latin or
Greek lower-case letter. The Unicode Character Database consistently
treats all such compatibility equivalents to single lowercase
Latin or Greek lower-case letters as category "Ll".

When taken in the larger context of all the compatibility
equivalents for various letters, it is quite clear that
treating characters such as FEMININE ORDINAL INDICATOR (which
after all is just a small 'a' written above the baseline) as
letters makes more sense than treating them as arbitrary
symbols. And the MICRO SIGN is really just a Greek letter mu
which was grandfathered into ISO 8859-1 to make it possible to
write such metric expressions as "µsec" and "µg" in a Latin-1 text
stream.

I am aware of such contrary specifications as the Example Locale 2
for ISO 8859-1 in the X/Open Internationalisation Guide, which
treats <inverted-exclamation-mark>;...;<inverted-question-mark>
(i.e. 0xA0..0xBF) as LC_CTYPE "punct". I consider that to be
a dreadful underspecification of the category differences
apparent in the characters U+00A0..U+00BF. FCD 14652, on the other
hand, at least recognizes U+00AA and U+00BA as "alpha", which
is a step forward.

--Ken Whistler

>
> Kind regards,
> Bob Verbrugge
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT