L2/01-198


From: "Mark Davis" <markdavis34@home.com>
Sent: Saturday, March 24, 2001 13:16

UTC Agenda Item: Comments on W3C character model, grapheme definition.

>
> The following comments were received by the W3C on their character model.
> They should more appropriately been directed to us.
>
> My take on them is:
>
> #1. There are many cases where identical-appearing character sequences are
> not identical under NFC. That was not a design criteria for NFC. We should
> make this clear, perhaps in an FAQ.
>
> 2, 3. Class 9 only has special meaning in the Unicode Standard in
> informative material describing grapheme boundaries. They do not affect
the
> W3C Character Model, since they are neither directly nor indirectly
> referenced.
>
> However, we should clean up the grapheme definition to not depend on class
> #9. I had made a proposal after the last meeting for an addition to
Unicode
> 3.2 to address that. See:
>
> http://www.unicode.org/unicode/members/L2001/01086-grapheme.htm
>
> We can then tune the properties independently of the combining class
(which
> cannot be changed).
>
> Mark
>
> > On 26/02/2001 15:23:04 Robert R. Chilton wrote:
> > > This is a response to the circulation of the W3C Working Draft (26
> > > January 2001) of "Character Model for the World Wide Web 1.0" and
> > > focuses particularly on considerations of string identity matching and
> > > string indexing as regards characters of the Tibetan Block
> > > (U+0F00-U+0FCF).
> > >
> > > Since each and every character in the Tibetan Block that has a
canonical
> > > decomposition is also listed in the Composition Exclusion Table,
Unicode
> > > Normalization Form C is equivalent to Unicode Normalization Form D for
> > > any string consisting of only Tibetan Block characters.  String
identity
> > > matching and string indexing should therefore be relatively simple for
> > > characters in this block.
> > >
> > > Unfortunately, there are two characters in the Tibetan Block that
could
> > > pose problems.
> > >
> > > 1.  U+0F7E poses serious problems in string identity matching.
> > >
> > > U+0F7E RJES SU NGA RO is erroneously assigned a canonical combining
> > > class of zero whereas it should be assigned the same combining class
> > > (cc = 230) as its related forms at U+0F82 and U+0F83.  A situation
> > > could easily arise wherein two strings which are identical in
appearance
> > > will not match, even after normalization.  As an example, here are two
> > > different ways that processes might encode the frequently occurring
> > > syllable HUUm:
> > >
> > >      0F67 0F7E 0F71 0F74   compared to   0F67 0F71 0F74 0F7E
> > > [cc:  0    0   129  132              cc:  0   129  132   0   ]
> > >
> > > These two strings have identical appearance and meaning and should,
> > > after normalization, be an identity match.  But because U+0F7E has a
> > > canonical combining class of 0, they will not match even after
> > > normalization.  This serious problem (of non-matching) can be avoided
> > > if U+0F7E is assigned a correct canonical combining class of 230.
> > >
> > > 2.  U+0F84 poses possible problems in string indexing.
> > >
> > > U+0F84 HALANTA is erroneously assigned a canonical combining class of
> > > nine, putting it in the class of Indic viramas.  In other Indic
scripts,
> > > these "vowel-killers" have a specific control behavior which is not
> > > applicable to the Tibetan Block -- where a different encoding model
with
> > > a set explicitly combining consonants [U+0F90 to U+0FBC] was adopted.
> > >
> > > The Tibetan mark halanta/virama (U+0F84) is simply a weak diacritical
> > > mark similar to U+0F39 or U+0F82 and it has no control function like
> > > U+094D.  If a process interprets the U+0F84 as a class 9 character,
> > > the process might assume that U+0F84 is a non-printing character and
> > > therefore would not count it as a character during certain types of
> > > string indexing.
> > >
> > > 3.  U+0F84 poses possible problems in text selection/cursor
positioning.
> > >
> > > Similarly, if a process interprets the U+0F84 as a class 9 character,
> > > the process might assume that the U+0F84 is acting (in the manner of
> > > an Indic virama) as a joiner and it might wrongly assume that the
glyph
> > > for the character that precedes U+0F84 is conjoined into a single
> > > ligature with the glyph for the character that follows the U+0F84.
Due
> > > to these erroneous assumptions, the process might expect (e.g., when
> > > determining cursor movement/placement and text selection) a display
> > > width that does not correspond with the actual display width of the
> > > characters in question.
> > >
> > >
> > > Robert Chilton
> > > Technical Director, Asian Classics Input Project (USA)
> > > UCA & ISO-14651 Specialist, DDC Dzongkha Computing Project (Bhutan)
>
> ----------
> http://www.macchiato.com
>