L2/05-304 Subj: General category and bidi property for some Indic characters Date: 10/19/2005 Source: Asmus Freytag Action: for UTC review I have analyzed the following request for GC and LB properties for some Indic characters. These were sent originally via the reporting form, but I understand that they will be submitted as a separate UTC document. For the sake of definiteness I have retained the request in the form it reached me. A./ >-----Original Message----- >Date/Time: Wed Oct 19 09:22:33 CST 2005 >Contact: kentk@cs.chalmers.se >Name: Kent Karlsson >Opt Subject: General category and bidi property for some Indic characters > > > > Comment: In terms of bidi properties, the distinction between bidi class L and bidi class NSM matters effectively only when a non-spacing mark is applied to a character that is not itself L. For script-specific marks from a Left-to-right script, that is nearly always the case, *except* when the character is exhibited in isolation, i.e. is applied to a NBSP. An NBSP followed by NSM would act as bidi neutral, while an NBSP followed by L would act as a neutral followed by an L, which could separate the two if they were in a mixed directionality context. Here is an illustration of the two concepts: ABC def ----> def ABC ABC def ----> def ABC where ABC/CBA is some right-to-left context and def some left-to-right context (in other words the case matters). Spaces are significant or not (removing them does not change the examples in any other way). , are each one character of the given bidi class. stands for the nobreak space. Therefore, from the bidi perspective, NSM is required for all combining characters that can be used on neutrals (or that need NBSP to be shown in isolation). This class includes many characters classed as Mc currently, because those that have a significant overhang over the base character would seem to require an NBSP to display correctly. For example 0B57 ORIYA AU LENGTH MARK (gc=Mc) overhangs as far to the left as 05B4 ORIYA AI LENGTH MARK (gc=Mn). The same is true for two-part vowels, which visually enclose a base character (such as 0BCA TAMIL VOWEL SIGN O). These have been given gc=Mc (not Me) and are bidi class L not NSM. If they are shown in isolation with an NBSP, they could become separated from their base character in bidi reordering. As a result, they might be graphically applied to an unrelated base character in after reordering. 0BCA decomposes canonically into 0BC6 and 0BBE. The bidi class of L for 0BBE is not an issue, since this character is clearly a spacing combining mark (simply follows the base character on the line). However, 0BC6 precedes the base character. Again, showing such a character in isolation would seem to require using a placeholder, such as NBSP for the base character. If such a placeholder is not of bidi class L (and NBSP is not), then the placeholder can be separated from the combining mark. Trying to assign gc=Mn and bidi=NSM to the left part of these two part vowels would mean that the decomposition of the two-part vowel goes from a single character with a single property to two characters with two different properties. This is problematic in and of itself. It would also not work as intended when trying to display the two part vowel using the combining sequence and NBSP. This is borne out by the following three examples: Two part vowel as two characters : * ABC def ----> def ABC Two part vowel as (two characters, with properties as currently defined): ABC def ----> def ABC Two part vowel as (single character, currently defined): ABC def ----> def ABC In all examples, the becomes separated from the combining character (expressed as ). The conclusion is that with the current system, NBSP *cannot* be used (by itself) to display any gc=Mc characters with bidi=L in isolation in a mixed directionality context. When such characters have significant visual overhang left or right, or enclose their base character, however, a placeholder character, whether NBSP or a dotted circle is required. Conclusion The solution in this case is to require the use of LRM in front of the NBSP. (This should be documented in UAX#9 and the Indic chapters). This should be done consistently for any gc=Mc (using a RLM for any that are in a RTL script). In this context, the proposed changes listed below would introduce a minor, formal consistency in property assignments, but they don't address the real issue of using these characters in a bidi context. A./ >These two are non-spacing: >0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;; -->> Mn, NSM >0BC2;TAMIL VOWEL SIGN UU;Mc;0;L;;;;;N;;;;; -->> Mn, NSN > >These two are non-spacing (as already noted in the general category): >0CBF;KANNADA VOWEL SIGN I;Mn;0;L;;;;;N;;;;; -->> L->NSM (already Mn) >0CC6;KANNADA VOWEL SIGN E;Mn;0;L;;;;;N;;;;; -->> L->NSM (already Mn) > > >These three are spacing: >0D41;MALAYALAM VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;; -->> Mc, L >0D42;MALAYALAM VOWEL SIGN UU;Mn;0;NSM;;;;;N;;;;; -->> Mc, L >0D43;MALAYALAM VOWEL SIGN VOCALIC R;Mn;0;NSM;;;;;N;;;;; -->> Mc, L > > >These three are spacing (as already noted in the general category): >1929;LIMBU SUBJOINED LETTER YA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc) >192A;LIMBU SUBJOINED LETTER RA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc) >192B;LIMBU SUBJOINED LETTER WA;Mc;0;NSM;;;;;N;;;;; -->> NSM->L (already Mc) > > >This one is non-spacing (as already noted in the bidi property value): >A802;SYLOTI NAGRI SIGN DVISVARA;Mc;0;NSM;;;;;N;;;;; -->> Mc --> Mn (already NSM) > > > > >-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report) >