Re: Property-Problems

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 05 2000 - 20:46:14 EST


Tobias Hunger asked:

> 1.) What are the EastAsian Width properties of the characters in the new
> Private Use areas (Plane 15/16)?

"A", the same as for the private use area in the BMP:

E000;A # <Private Use, First>
F8FF;A # <Private Use, Last>

>
> 2.) What are the Linebreaking Properties for those characters?

"AI", the same as for the private use are in the BMP:

E000;AI # <Private Use, First>
F8FF;AI # <Private Use, Last>

Property assignments for the private use areas are defaults only, of
course, and could always be overridden by an application which
actually makes use of particular interpretations of private use
character codes.

>
> 3.) How do you generate the PropList File? Some of the properties are quite
> obvious (for example the Bidi-Properties), but others are a mystery to me.

The PropList.txt file is an informative data file contributed by
Sybase. It is undergoing review by the Unicode Technical Committee
now for release as a more definitive list of properties maintained
by the UTC itself.

Some of the properties currently in PropList.txt are completely
derivative from information in UnicodeData.txt, but were included
in PropList.txt despite their redundancy, since PropList.txt gives
a different "view" on properties. It gives a property by property
list of all the characters with a particular property.

Some others of the properties in PropList.txt are *not* derivable
from UnicodeData.txt. These are the ones currently under review,
and a revised version of PropList.txt will be issued with Unicode 3.1,
with further explanation of these properties.

> Some examples:
>
> (upper|lower|title)case Properties:
> I though it had something to do with the General Categories Lu, Ll and Lt,
> but that asumption was obviously wrong.
> For example U+02B6 is obviously a uppercase character (looking at the
> drawing in the book), has the Category Lm and the lowercase-Property.

It has *something* to do with Lu, Ll, and Lt, but there are differences,
because of the deficiencies of General Category in UnicodeData.txt for
carrying all information about case.

In this particular case, you are talking about "small-caps" phonetic
modifier letters. Such letters are notionally lowercase in usage, even
though their form is derived from an uppercase letter. Their General
Category assignment of "Lm" fudges the issue.

>
> Decimal Digit Value-Property:
> I exspected that all characters that have a decimal digit value set in the
> Character Database had this property set. But that is not allways the case.
> The same goes for the Numeric and Digit Property.

The deltas have explanations.

Again, what it comes down to is that the character properties provided
in UnicodeData.txt, while a good, general start on properties, don't
suffice to make all the distinctions people may need to make.

>
> 4.) Which characters are those in the Virama, Joining Character Classes
> mentioned in Table 5-3? It would be great if there was a Virama and a Joining
> Property in the Property List.

The viramas (halants) are shown on page 81 of the Unicode Standard, Version 3.0.
The clue is that they all have combining class 9.

--Ken

> Looking for hints I found several VIRAMA-characters in the datafile. Do I
> need to use those with VIRAMA in the 'character name'-field and/or in the
> 'unicode 1.0 name'-field of UnicodeData*.txt?
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT