L2/01-198 From: "Mark Davis" Sent: Saturday, March 24, 2001 13:16 UTC Agenda Item: Comments on W3C character model, grapheme definition. > > The following comments were received by the W3C on their character model. > They should more appropriately been directed to us. > > My take on them is: > > #1. There are many cases where identical-appearing character sequences are > not identical under NFC. That was not a design criteria for NFC. We should > make this clear, perhaps in an FAQ. > > 2, 3. Class 9 only has special meaning in the Unicode Standard in > informative material describing grapheme boundaries. They do not affect the > W3C Character Model, since they are neither directly nor indirectly > referenced. > > However, we should clean up the grapheme definition to not depend on class > #9. I had made a proposal after the last meeting for an addition to Unicode > 3.2 to address that. See: > > http://www.unicode.org/unicode/members/L2001/01086-grapheme.htm > > We can then tune the properties independently of the combining class (which > cannot be changed). > > Mark > > > On 26/02/2001 15:23:04 Robert R. Chilton wrote: > > > This is a response to the circulation of the W3C Working Draft (26 > > > January 2001) of "Character Model for the World Wide Web 1.0" and > > > focuses particularly on considerations of string identity matching and > > > string indexing as regards characters of the Tibetan Block > > > (U+0F00-U+0FCF). > > > > > > Since each and every character in the Tibetan Block that has a canonical > > > decomposition is also listed in the Composition Exclusion Table, Unicode > > > Normalization Form C is equivalent to Unicode Normalization Form D for > > > any string consisting of only Tibetan Block characters. String identity > > > matching and string indexing should therefore be relatively simple for > > > characters in this block. > > > > > > Unfortunately, there are two characters in the Tibetan Block that could > > > pose problems. > > > > > > 1. U+0F7E poses serious problems in string identity matching. > > > > > > U+0F7E RJES SU NGA RO is erroneously assigned a canonical combining > > > class of zero whereas it should be assigned the same combining class > > > (cc = 230) as its related forms at U+0F82 and U+0F83. A situation > > > could easily arise wherein two strings which are identical in appearance > > > will not match, even after normalization. As an example, here are two > > > different ways that processes might encode the frequently occurring > > > syllable HUUm: > > > > > > 0F67 0F7E 0F71 0F74 compared to 0F67 0F71 0F74 0F7E > > > [cc: 0 0 129 132 cc: 0 129 132 0 ] > > > > > > These two strings have identical appearance and meaning and should, > > > after normalization, be an identity match. But because U+0F7E has a > > > canonical combining class of 0, they will not match even after > > > normalization. This serious problem (of non-matching) can be avoided > > > if U+0F7E is assigned a correct canonical combining class of 230. > > > > > > 2. U+0F84 poses possible problems in string indexing. > > > > > > U+0F84 HALANTA is erroneously assigned a canonical combining class of > > > nine, putting it in the class of Indic viramas. In other Indic scripts, > > > these "vowel-killers" have a specific control behavior which is not > > > applicable to the Tibetan Block -- where a different encoding model with > > > a set explicitly combining consonants [U+0F90 to U+0FBC] was adopted. > > > > > > The Tibetan mark halanta/virama (U+0F84) is simply a weak diacritical > > > mark similar to U+0F39 or U+0F82 and it has no control function like > > > U+094D. If a process interprets the U+0F84 as a class 9 character, > > > the process might assume that U+0F84 is a non-printing character and > > > therefore would not count it as a character during certain types of > > > string indexing. > > > > > > 3. U+0F84 poses possible problems in text selection/cursor positioning. > > > > > > Similarly, if a process interprets the U+0F84 as a class 9 character, > > > the process might assume that the U+0F84 is acting (in the manner of > > > an Indic virama) as a joiner and it might wrongly assume that the glyph > > > for the character that precedes U+0F84 is conjoined into a single > > > ligature with the glyph for the character that follows the U+0F84. Due > > > to these erroneous assumptions, the process might expect (e.g., when > > > determining cursor movement/placement and text selection) a display > > > width that does not correspond with the actual display width of the > > > characters in question. > > > > > > > > > Robert Chilton > > > Technical Director, Asian Classics Input Project (USA) > > > UCA & ISO-14651 Specialist, DDC Dzongkha Computing Project (Bhutan) > > ---------- > http://www.macchiato.com >