L2/05-169 Author: Ken Whistler Date: 2005-July-11 Title: Proposal to Obsolete the Grapheme_Link Property With reference to the email note I sent to unicore today (see appended), it occurred to me that not only should the documentation in UCD.html be updated to reflect the reality of non-usage of the Grapheme_Link property in UAX #29 for the next release, but that the appropriate action might be to simply retire the unused Grapheme_Link property altogether from PropList.txt. As far as I can tell, Grapheme_Link is a property that was DOA. Its only use in the UCD is: PropList.txt: defines the property and the list of characters for it. PropertyAliases.txt: which gives "Gr_Link" and "Grapheme_Link" as short and long names. UCD.html: which erroneously claims that the property is used in UAX #29, Text Boundaries. It is not used in UAX #29, and was only ever mentioned in the *proposed* draft UTR #29 -- which I don't believe is sufficient reason for leaving it documented in the UCD, particularly since the current membership of the property is inconsistent with the claim regarding its membership in that proposed draft, in the first place. Grapheme_Link is not used in the derivation of any other property. And, if anything, it should be considered a derived property, in the first place (if it even had any meaning), because its intent was to consist of all the "viramas" plus the combining grapheme joiner -- but for some reason it has omitted the Tibetan and Hanunoo viramas. There *could* be some principled reason for that, but I don't actually think it is anything other than things getting out of synch. The better definition would be the derivation: {ccc=9} + \u034F which would have a better chance of staying consistent, at least. Furthermore, the property as defined is obsolete by the current definition of the COMBINING GRAPHEME JOINER in the standard -- which is leading to the kinds of confusion documented below in the email. I don't think it is a good idea to leave a property officially defined in the UCD which implicitly is claiming that the CGJ and viramas have anything principled in common in terms of cluster or syllable structure, when what we are saying is that the CGJ is either: A. An invisible character which can be weighted to distinguish sequences for collation. or B. A hack, which because of its own properties, can be used to prevent canonical reordering of sequences of diacritics, and thereby preserve distinctions for rendering which would otherwise be subject to loss under normalization. Because of these considerations, I would like the UTC to debate this issue and decide what to do about Grapheme_Link for the next release of the standard. I am pressing this issue because I think we need a kind of a test case for the constitutional law, as it were, regarding properties. Up until the introduction of the PropertyAliases and PropertyValueAliases files, there were willy-nilly instances of dropping properties from the data files. We may, however, be at the point where *no* property, once defined, can *ever* be dropped form the standard, no matter how wrong-headed or inconsistent it may eventually turn out to be. I want that issue decided, so we know how to proceed in managing the every-expanding list of character properties in the standard, and so we know the consequences of defining any new ones. My proposed alternatives are numbered below, to make it a little easier to discuss them: Option 1 Simply remove the Grapheme_Link property from PropList.txt, PropertyAliases.txt and UCD.html. Property termination "with extreme prejudice." :-) Option 2 Document the Grapheme_Link property as stabilized and deprecated, both in PropList.txt and UCD.html, and stop maintaining it. Option 3 Remove the Grapheme_Link property from PropList.txt, but add it as a *derived* property in DerivedCoreProperties.txt. Update the documentation in UCD.html and deprecate the property. Option 4 Leave the Grapheme_Link property as is, but document why it cannot be derived from {ccc=9}. Update the documentation in UCD.html to remove the erroneous claim that it is used in UAX #29, and explain what Grapheme_Link actually is (whatever that may be). Option 5 Leave the Grapheme_Link property in *name* only, but in PropList.txt remove all the characters defined to have it, to that it becomes an empty set for Unicode 5.0. Document this change also in UCD.html. [I include option as another logical possibility, although I don't consider it a good one -- but I think the UTC should debate it to help set the limits on what can reasonably be changed in character properties.] ------------- Begin Forwarded Message ------------- Date: Mon, 11 Jul 2005 13:03:49 -0700 (PDT) From: Kenneth Whistler Subject: Re: Grapheme Clusters To: FSusi@zebra.com Cc: unicore@unicode.org, kenw@sybase.com Fred Susi asked: >> We are working on an implementation that requires the >> identification of grapheme clusters and seem to have >> come across some inconsistencies in the standard. In >> particular we are concerned with the application of the >> GraphemeLink property to the determination of Grapheme Boundaries. >> >> In looking over the characters that have the GraphemeLink >> property = true, most are Indic Viramas, and the remaining >> is the U+034F COMBINING GRAPHEME JOINER. Correct. >> It appears that >> the purpose of this property is to allow the joining of two >> grapheme bases into a single grapheme cluster. What UCD.html currently says is: "Used in determining default grapheme cluster boundaries." and then refers you to UAX #29. As you have noted, Grapheme_Extend is used there (in the table of Boundary Property Values), but not Grapheme_Link. In point of fact, I think you have to track back through the document links all the way to the Proposed Draft UTR #29, tr29-1.html, to find explicit reference to Grapheme_Link in the document. What happened is that after extensive discussion by the UTC, and before UAX #29 was finally approved and released for Unicode 4.0, the attempt to have "default grapheme cluster boundaries" apply to Indic aksaras was abandoned as way too complicated and problematical. Default grapheme cluster boundaries were simplified to the statement in the currently approved UAX #29. >> In fact, this is even demonstrated in UTR #28 Section 3.9. ^^^ UAX >> In this section, the following example of Enclosing Combining >> Marks is given: >> >> U+0915 DEVANAGARI KA >> U+094D DEVANAGARI SIGN VIRAMA >> U+0922 DEVANAGARI LETTER DDHA >> U+20DD COMBINING ENCLOSING CIRCLE >> >> The section states that the Combining Enclosing Circle should >> enclose the entire conjunct described above. This is true >> because it is composed of elements linked by a character >> with the property Grapheme_Link. UAX #28 is the delta document defining Unicode 3.2. That text has been superseded and updated by Unicode 4.0. If you go looking for the corresponding text as edited and approved for publication in Unicode 4.0, you'll find it on p. 83, where the Korean examples survived, but the Devanagari example was *explicitly* omitted because of the decisions taken by the UTC on this issue. >> >> However, in Table 1 - Default Grapheme Cluster Boundaries of >> UAX #29 Text Boundaries, there is no mention of the >> Grapheme_Link property. Correct. >> It would seem to us that the Boundary Property Values should >> include a LINK: Grapheme_Link = True entry. Also, it would >> seem as if the Boundary Rules should contain a rule between >> 9 and 10 such that: >> 9A) Do not break after link characters: Link x Any No. The table is correct as published. >> >> I'm completely misunderstanding something? >> >> I appreciate any feedback this group can give. If you are working on an implementation that needs to provide appropriate breaks in Indic syllables, then you need to be going beyond what UAX #29 defines for default grapheme cluster boundaries. The UAX #29 definition does not cover that now. Also, the whole business of application of combining marks, particularly enclosing combining marks, *continues* to be confusing in the standard -- even *after* the publication of Unicode 4.0. The text in Chapter 3 on this topic is undergoing substantial review right now in an effort to try (once again) to do better for the next version -- Unicode 5.0. Regards, --Ken Whistler