UTC/2000-007

Posted on unicore 2000-01-11
From: Mark Davis
Re: Clarification of "cased" in UTR21.

Based on the following email, I propose a change to the definition of
"cased" in UTR21.


> To:     Mark Davis/Cupertino/IBM@IBMUS
> cc:     <john_thomson@sil.org>
> Subject:     questions regarding Special Casing
>        Hello,
>        Thanks for your contributions to spelling out Unicode for us
>        developers and users.
>        I'm working with a group that's developing linguistic tools.
>        One of our goals is to
>        comply with the Unicode 3.0 standard, including its
>        specifications for character
>        properties and case mappings.  In reading your UTR #21 (Case
>        Mappings --
>        revision 3.0 11/03/1999) there were a couple of points that
>        were unclear to me.
>
>        Under section 2, "Guidelines", the bullets say,
>             In all of the guidelines given below ... Treat 0345
>        "combining iota subscript" as a lowercase letter.
>        Currently in the Unicode data file UnicodeData.txt (v 3.0),
>        character 0345's general category is "Mn"
>        (mark, non-spacing).  Is your guideline here a correction, i.e.
>        should 0345's general category be changed to
>        "Ll"?

No, what that means is that while for general purposes 0345 is correctly
characterized as Mn, for the purposes of case mappings *in the following
discussion* it should be handled differently.

> Another bullet in that list says
>             A character is _cased_ if it is marked as uppercase,
>        lowercase, or titlecase (Lu, Ll, Lt).
>        If this definition is complete, then is character 0345
>        considered cased?  In a similar vein, are
>        characters that have explicit case mappings considered cased,
>        even if they are not "letters"?
>        E.g.
>        24B6;CIRCLED LATIN CAPITAL LETTER A;So;0;L;<circle>
>        0041;;;;N;;;;24D0;
>        2160;ROMAN NUMERAL ONE;Nl;0;L;<compat> 0049;;;1;N;;;;2170;

This is a good point. For non-letters, it is a matter of trying to match
user expectations. Suppose that a user selected a paragraph of text and
lowercased it using a menu command. Would s/he expect to see roman
numerials and circled letters lowercased? I suspect so.

>
>        In practical terms, if a string contains U+24B6 and no
>        lowercase characters,
>        should it be considered an uppercase string?  If this string is
>        converted to
>        lowercase, should the 24B6 be converted to 24D0?
>
>        It would perhaps be helpful to mention in your document the
>        existence of non-letters that have case mappings,
>        and clarify what the correct treatment of them would be
>        according to the standard.

Agreed. The document should probably specify _cased_ to include non-letters
that have case mappings. I will bring this up at the next Unicode Technical
Committee meeting.


Page 1		Document2