L2/05-304

Subj: General category and bidi property for some Indic characters
Date: 10/19/2005
Source: Asmus Freytag

Action: for UTC review

I have analyzed the following request for GC and LB properties for some
Indic characters. These were sent originally via the reporting form, but
I understand that they will be submitted as a separate UTC document.

For the sake of definiteness I have retained the request in the form it reached
me.

A./

>-----Original Message-----
>Date/Time: Wed Oct 19 09:22:33 CST 2005
>Contact: kentk@cs.chalmers.se
>Name: Kent Karlsson
>Opt Subject: General category and bidi property for some Indic characters
>
>
>
>
Comment:

In terms of bidi properties, the distinction between bidi class L and
bidi class NSM matters effectively only when a non-spacing mark is
applied to a character that is not itself L. For script-specific marks
from a Left-to-right script, that is nearly always the case, *except*
when the character is exhibited in isolation, i.e. is applied to a NBSP.

An NBSP followed by NSM would act as bidi neutral, while an NBSP
followed by L would act as a neutral followed by an L, which could
separate the two if they were in a mixed directionality context.

Here is an illustration of the two concepts:

ABC <NBSP><NSM> def     ---->   def <NSM><NBSP> ABC

ABC <NBSP><L> def       ---->   <L> def<NBSP> ABC


where ABC/CBA is some right-to-left context and def some left-to-right
context (in other words the case matters). Spaces are significant or not
(removing them does not change the examples in any other way). <L>,
<NSM> are each one character of the given bidi class. <NBSP> stands for
the nobreak space.

Therefore, from the bidi perspective, NSM is required for all combining
characters that can be used on neutrals (or that need NBSP to be shown
in isolation).

This class includes many characters classed as Mc currently, because
those that have a significant overhang over the base character would
seem to require an NBSP to display correctly. For example 0B57 ORIYA AU
LENGTH MARK (gc=Mc) overhangs as far to the left as 05B4 ORIYA AI LENGTH
MARK (gc=Mn).

The same is true for two-part vowels, which visually enclose a base
character (such as 0BCA TAMIL VOWEL SIGN O). These have been given gc=Mc
(not Me) and are bidi class L not NSM. If they are shown in isolation
with an NBSP, they could become separated from their base character in
bidi reordering. As a result, they might be graphically applied to an
unrelated base character in after reordering.

0BCA decomposes canonically into 0BC6 and 0BBE. The bidi class of L for
0BBE is not an issue, since this character is clearly a spacing
combining mark (simply follows the base character on the line). However,
0BC6 precedes the base character. Again, showing such a character in
isolation would seem to require using a placeholder, such as NBSP for
the base character. If such a placeholder is not of bidi class L (and
NBSP is not), then the placeholder can be separated from the combining mark.

Trying to assign gc=Mn and bidi=NSM to the left part of these two part
vowels would mean that the decomposition of the two-part vowel goes from
a single character with a single property to two characters with two
different properties. This is problematic in and of itself. It would
also not work as intended when trying to display the two part vowel
using the combining sequence and NBSP. This is borne out by the
following three examples:

Two part vowel as two characters <NSM><L>:

* ABC <NBSP><NSM><L> def ---->   <L> def<NSM><NBSP> ABC


Two part vowel as <L><L> (two characters, with properties as currently
defined):

ABC <NBSP><L><L> def    ---->   <L><L> def<NBSP> ABC


Two part vowel as <L> (single character, currently defined):

ABC <NBSP><L> def       ---->   <L> def<NBSP> ABC


In all examples, the <NBSP> becomes separated from the combining
character (expressed as <L>).

The conclusion is that with the current system, NBSP *cannot* be used
(by itself) to display any gc=Mc characters with bidi=L in isolation in
a mixed directionality context. When such characters have significant
visual overhang left or right, or enclose their base character, however,
a placeholder character, whether NBSP or a dotted circle is required.

Conclusion

The solution in this case is to require the use of LRM in front of the NBSP.
(This should be documented in UAX#9 and the Indic chapters). This should
be done consistently for any gc=Mc (using a RLM for any that are in a
RTL script).

In this context, the proposed changes listed below would introduce a minor,
formal consistency in property assignments, but they don't address the real
issue of using these characters in a bidi context.

A./

>These two are non-spacing:
>0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;; -->> Mn, NSM
>0BC2;TAMIL VOWEL SIGN UU;Mc;0;L;;;;;N;;;;; -->> Mn, NSN
>
>These two are non-spacing (as already noted in the general category):
>0CBF;KANNADA VOWEL SIGN I;Mn;0;L;;;;;N;;;;; -->> L->NSM (already Mn)
>0CC6;KANNADA VOWEL SIGN E;Mn;0;L;;;;;N;;;;; -->> L->NSM (already Mn)
>
>
>These three are spacing:
>0D41;MALAYALAM VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;; -->> Mc, L
>0D42;MALAYALAM VOWEL SIGN UU;Mn;0;NSM;;;;;N;;;;; -->> Mc, L
>0D43;MALAYALAM VOWEL SIGN VOCALIC R;Mn;0;NSM;;;;;N;;;;; -->> Mc, L
>
>
>These three are spacing (as already noted in the general category):
>1929;LIMBU SUBJOINED LETTER YA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc)
>192A;LIMBU SUBJOINED LETTER RA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc)
>192B;LIMBU SUBJOINED LETTER WA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc)
>
>
>This one is non-spacing (as already noted in the bidi property value):
>A802;SYLOTI NAGRI SIGN DVISVARA;Mc;0;NSM;;;;;N;;;;; -->> Mc --> Mn (already NSM)
>
>
>
>
>-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
>