L2/01-134

From: Mark Davis [markdavis34@home.com]
Sent: Saturday, March 24, 2001 3:16 PM

Comments on W3C character model, grapheme definition


The following comments were received by the W3C on their character model.
They should more appropriately been directed to us.

My take on them is:

#1. There are many cases where identical-appearing character sequences are
not identical under NFC. That was not a design criteria for NFC. We should
make this clear, perhaps in an FAQ.

2, 3. Class 9 only has special meaning in the Unicode Standard in
informative material describing grapheme boundaries. They do not affect the
W3C Character Model, since they are neither directly nor indirectly
referenced.

However, we should clean up the grapheme definition to not depend on class
#9. I had made a proposal after the last meeting for an addition to Unicode
3.2 to address that. See:

http://www.unicode.org/unicode/members/L2001/01086-grapheme.htm

We can then tune the properties independently of the combining class (which
cannot be changed).

Mark


On 26/02/2001 15:23:04 Robert R. Chilton wrote:
> This is a response to the circulation of the W3C Working Draft (26
> January 2001) of "Character Model for the World Wide Web 1.0" and
> focuses particularly on considerations of string identity matching and
> string indexing as regards characters of the Tibetan Block
> (U+0F00-U+0FCF).
>
> Since each and every character in the Tibetan Block that has a canonical
> decomposition is also listed in the Composition Exclusion Table, Unicode
> Normalization Form C is equivalent to Unicode Normalization Form D for
> any string consisting of only Tibetan Block characters.  String identity
> matching and string indexing should therefore be relatively simple for
> characters in this block.
>
> Unfortunately, there are two characters in the Tibetan Block that could
> pose problems.
>
> 1.  U+0F7E poses serious problems in string identity matching.
>
> U+0F7E RJES SU NGA RO is erroneously assigned a canonical combining
> class of zero whereas it should be assigned the same combining class
> (cc = 230) as its related forms at U+0F82 and U+0F83.  A situation
> could easily arise wherein two strings which are identical in appearance
> will not match, even after normalization.  As an example, here are two
> different ways that processes might encode the frequently occurring
> syllable HUUm:
>
>      0F67 0F7E 0F71 0F74   compared to   0F67 0F71 0F74 0F7E
> [cc:  0    0   129  132              cc:  0   129  132   0   ]
>
> These two strings have identical appearance and meaning and should,
> after normalization, be an identity match.  But because U+0F7E has a
> canonical combining class of 0, they will not match even after
> normalization.  This serious problem (of non-matching) can be avoided
> if U+0F7E is assigned a correct canonical combining class of 230.
>
> 2.  U+0F84 poses possible problems in string indexing.
>
> U+0F84 HALANTA is erroneously assigned a canonical combining class of
> nine, putting it in the class of Indic viramas.  In other Indic scripts,
> these "vowel-killers" have a specific control behavior which is not
> applicable to the Tibetan Block -- where a different encoding model with
> a set explicitly combining consonants [U+0F90 to U+0FBC] was adopted.
>
> The Tibetan mark halanta/virama (U+0F84) is simply a weak diacritical
> mark similar to U+0F39 or U+0F82 and it has no control function like
> U+094D.  If a process interprets the U+0F84 as a class 9 character,
> the process might assume that U+0F84 is a non-printing character and
> therefore would not count it as a character during certain types of
> string indexing.
>
> 3.  U+0F84 poses possible problems in text selection/cursor positioning.
>
> Similarly, if a process interprets the U+0F84 as a class 9 character,
> the process might assume that the U+0F84 is acting (in the manner of
> an Indic virama) as a joiner and it might wrongly assume that the glyph
> for the character that precedes U+0F84 is conjoined into a single
> ligature with the glyph for the character that follows the U+0F84.  Due
> to these erroneous assumptions, the process might expect (e.g., when
> determining cursor movement/placement and text selection) a display
> width that does not correspond with the actual display width of the
> characters in question.
>
>
> Robert Chilton
> Technical Director, Asian Classics Input Project (USA)
> UCA & ISO-14651 Specialist, DDC Dzongkha Computing Project (Bhutan)

----------
http://www.macchiato.com