L2/05-169

Author: Ken Whistler
Date:   2005-July-11
Title:  Proposal to Obsolete the Grapheme_Link Property


With reference to the email note I sent to unicore today
(see appended), it occurred to me that not only should
the documentation in UCD.html be updated to reflect the
reality of non-usage of the Grapheme_Link property in
UAX #29 for the next release, but that the appropriate
action might be to simply retire the unused Grapheme_Link
property altogether from PropList.txt.

As far as I can tell, Grapheme_Link is a property that was
DOA. Its only use in the UCD is:

PropList.txt: defines the property and the list of characters
    for it.
    
PropertyAliases.txt: which gives "Gr_Link" and "Grapheme_Link"
    as short and long names.
    
UCD.html: which erroneously claims that the property is used
    in UAX #29, Text Boundaries.
    
It is not used in UAX #29, and was only ever mentioned in the
*proposed* draft UTR #29 -- which I don't believe is sufficient
reason for leaving it documented in the UCD, particularly since
the current membership of the property is inconsistent with the
claim regarding its membership in that proposed draft, in the
first place.

Grapheme_Link is not used in the derivation of any other
property. And, if anything, it should be considered a derived
property, in the first place (if it even had any meaning),
because its intent was to consist of all the "viramas" plus
the combining grapheme joiner -- but for some reason it has
omitted the Tibetan and Hanunoo viramas. There *could* be some
principled reason for that, but I don't actually think it is
anything other than things getting out of synch. The better
definition would be the derivation:

   {ccc=9} + \u034F
   
which would have a better chance of staying consistent, at least.

Furthermore, the property as defined is obsolete by the
current definition of the COMBINING GRAPHEME JOINER in
the standard -- which is leading to the kinds of confusion
documented below in the email. I don't think it is a good
idea to leave a property officially defined in the UCD which
implicitly is claiming that the CGJ and viramas have anything
principled in common in terms of cluster or syllable structure,
when what we are saying is that the CGJ is either:

   A. An invisible character which can be weighted to distinguish
      sequences for collation.
      
or

   B. A hack, which because of its own properties, can be used
      to prevent canonical reordering of sequences of diacritics,
      and thereby preserve distinctions for rendering which would
      otherwise be subject to loss under normalization.
      
Because of these considerations, I would like the UTC to debate this
issue and decide what to do about Grapheme_Link for the next
release of the standard.

I am pressing this issue because I think we need a kind of a test
case for the constitutional law, as it were, regarding properties.
Up until the introduction of the PropertyAliases and PropertyValueAliases
files, there were willy-nilly instances of dropping properties from
the data files. We may, however, be at the point where *no* property,
once defined, can *ever* be dropped form the standard, no matter
how wrong-headed or inconsistent it may eventually turn out to be.
I want that issue decided, so we know how to proceed in managing
the every-expanding list of character properties in the standard,
and so we know the consequences of defining any new ones.

My proposed alternatives are numbered below, to make it a little
easier to discuss them:

Option 1

Simply remove the Grapheme_Link property from PropList.txt,
PropertyAliases.txt and UCD.html. Property termination "with
extreme prejudice." :-)

Option 2

Document the Grapheme_Link property as stabilized and deprecated,
both in PropList.txt and UCD.html, and stop maintaining it.

Option 3

Remove the Grapheme_Link property from PropList.txt, but add it
as a *derived* property in DerivedCoreProperties.txt. Update the
documentation in UCD.html and deprecate the property.

Option 4

Leave the Grapheme_Link property as is, but document why it cannot
be derived from {ccc=9}. Update the documentation in UCD.html to
remove the erroneous claim that it is used in UAX #29, and
explain what Grapheme_Link actually is (whatever that may be).

Option 5

Leave the Grapheme_Link property in *name* only, but in
PropList.txt remove all the characters defined to have it,
to that it becomes an empty set for Unicode 5.0. Document this
change also in UCD.html.

[I include option as another logical possibility, although I
don't consider it a good one -- but I think the UTC should
debate it to help set the limits on what can reasonably be
changed in character properties.]

    
------------- Begin Forwarded Message -------------

Date: Mon, 11 Jul 2005 13:03:49 -0700 (PDT)
From: Kenneth Whistler <kenw@sybase.com>
Subject: Re: Grapheme Clusters
To: FSusi@zebra.com
Cc: unicore@unicode.org, kenw@sybase.com

Fred Susi asked:


>> We are working on an implementation that requires the 
>> identification of grapheme clusters and seem to have 
>> come across some inconsistencies in the standard.  In 
>> particular we are concerned with the application of the 
>> GraphemeLink property to the determination of Grapheme Boundaries.
>> 
>> In looking over the characters that have the GraphemeLink 
>> property = true, most are Indic Viramas, and the remaining 
>> is the U+034F COMBINING GRAPHEME JOINER.


Correct.


>>  It appears that 
>> the purpose of this property is to allow the joining of two 
>> grapheme bases into a single grapheme cluster. 


What UCD.html currently says is:

"Used in determining default grapheme cluster boundaries."

and then refers you to UAX #29. As you have noted, Grapheme_Extend
is used there (in the table of Boundary Property Values), but
not Grapheme_Link. In point of fact, I think you have to track
back through the document links all the way to the Proposed
Draft UTR #29, tr29-1.html, to find explicit reference to
Grapheme_Link in the document.

What happened is that after extensive discussion by the UTC,
and before UAX #29 was finally approved and released for
Unicode 4.0, the attempt to have "default grapheme cluster
boundaries" apply to Indic aksaras was abandoned as way too
complicated and problematical. Default grapheme cluster
boundaries were simplified to the statement in the currently
approved UAX #29.
 

>> In fact, this is even demonstrated in UTR #28 Section 3.9.

                                        ^^^
                                        UAX  

>> In this section, the following example of Enclosing Combining 
>> Marks is given:
>> 
>> 	U+0915 DEVANAGARI KA
>> 	U+094D DEVANAGARI SIGN VIRAMA
>> 	U+0922 DEVANAGARI LETTER DDHA
>> 	U+20DD COMBINING ENCLOSING CIRCLE
>> 
>> The section states that the Combining Enclosing Circle should 
>> enclose the entire conjunct described above.  This is true 
>> because it is composed of elements linked by a character 
>> with the property Grapheme_Link.


UAX #28 is the delta document defining Unicode 3.2. That
text has been superseded and updated by Unicode 4.0.

If you go looking for the corresponding text as edited and
approved for publication in Unicode 4.0, you'll find it
on p. 83, where the Korean examples survived, but the
Devanagari example was *explicitly* omitted because of the
decisions taken by the UTC on this issue.


>> 
>> However, in Table 1 - Default Grapheme Cluster Boundaries of 
>> UAX #29 Text Boundaries, there is no mention of the 
>> Grapheme_Link property. 


Correct.


>> It would seem to us that the Boundary Property Values should 
>> include a  LINK: Grapheme_Link = True entry.  Also, it would 
>> seem as if the Boundary Rules should contain a rule between 
>> 9 and 10 such that:
>> 	9A) Do not break after link characters:   Link x Any


No. The table is correct as published.


>> 
>> I'm completely misunderstanding something?
>> 
>> I appreciate any feedback this group can give.


If you are working on an implementation that needs to provide
appropriate breaks in Indic syllables, then you need to be
going beyond what UAX #29 defines for default grapheme cluster
boundaries. The UAX #29 definition does not cover that now.

Also, the whole business of application of combining marks,
particularly enclosing combining marks, *continues* to be
confusing in the standard -- even *after* the publication
of Unicode 4.0. The text in Chapter 3 on this topic is
undergoing substantial review right now in an effort to try
(once again) to do better for the next version -- Unicode 5.0.

Regards,

--Ken Whistler