L2/07-202 Date: June 14, 2007 Title: Script Consistency Issue for Compatibility Greek Accents Source: Ken Whistler Action: For consideration by the UTC References: L2/07-071 Background Mark Davis and I had an action item (110-A092) to update a number of character properties, based on the canonical checking results reported in L2/07-071. In the process of double-checking the results of the various changes and derivations, I turned up a countervailing consistency issue in the compatibility Greek accents which is troublesome, and which I would like the UTC to discuss. The issue arises because of the proposed change (for Unicode 5.1.0) of the script property for five of the Greek compatibility accents from Script=Greek to Script=Common. The four characters in question are 1FC1, 1FED..1FEF, and 1FFD. The reason for proposing this change in the first place is that those four characters have canonical equivalences to other spacing accents that have Script=Common. In particular: 1FC1 = <00A8, 0342> 1FED = <00A8, 0300> 1FEE = 0385 = <00A8, 0301> 1FEF = 0060 1FFD = 00B4 where 00A8, 0060, and 00B4 all have Script=Common. The problem is that this change then leaves behind other compatibility Greek accents which *don't* have canonical equivalences to spacing accents with Script=Common, leading to an inconsistency in script handling for this batch of characters. In particular, some of the relevant entries from Scripts.txt are show here. Unicode 5.0.0 (current standard) 1FBF..1FC1 ; Greek # Sk [3] GREEK PSILI..GREEK DIALYTIKA AND PERISPOMENI 1FED..1FEF ; Greek # Sk [3] GREEK DIALYTIKA AND VARIA..GREEK VARIA 1FFD..1FFE ; Greek # Sk [2] GREEK OXIA..GREEK DASIA Unicode 5.1.0 (proposed on the basis of L2/07-071) 1FC1 ; Common # Sk GREEK DIALYTIKA AND PERISPOMENI 1FED..1FEF ; Common # Sk [3] GREEK DIALYTIKA AND VARIA..GREEK VARIA 1FFD ; Common # Sk GREEK OXIA 1FBF..1FC0 ; Greek # Sk [2] GREEK PSILI..GREEK PERISPOMENI 1FFE ; Greek # Sk GREEK DASIA I don't think that the proposed resolution, based strictly on the canonical equivalence relationships for these spacing compatibility characters makes much sense when considered in the context of the Greek Extended block itself. It will be seen as arbitrarily removing 5 Greek accent marks from the Greek script, while leaving 3 others as Greek. And it will be nearly impossible for implementers of the standard to track down and make sense of the reasoning for the differentiation. The distinction makes even less sense when one considers the results of doing an NFKC (or NFKD) normalization on these characters. Because all of the spacing accent marks expand in NFKC to sequences with SPACE (except U+0060, which was excepted because it is an ASCII character), you end up with: 1FC1 --> 0020 0308 0342 ; Common # GREEK DIALYTIKA AND PERISPOMENI 1FED --> 0020 0308 0300 ; Common # GREEK DIALYTIKA AND VARIA 1FEE --> 0020 0308 0301 ; Common # GREEK DIALYTIKA AND OXIA 1FEF --> 0060 ; Common # GREEK VARIA 1FFD --> 0020 0301 ; Common # GREEK OXIA 1FBF --> 0020 0313 ; Greek # GREEK PSILI..GREEK PERISPOMENI 1FC0 --> 0020 0342 ; Greek # GREEK PSILI..GREEK PERISPOMENI 1FFE --> 0020 0314 ; Greek # GREEK DASIA where in the NFKC (or NFKD) forms SPACE and U+0060 are Script=Common, the combining marks are all Script=Inherited and none of the characters are Script=Greek. While there is no compelling reason why a compatibly equivalent sequence must have the same script property value, this particular inconsistency is troubling to me, and I suspect that the proposed change in script property values will be met with general headscratching and confusion by outside users of the standard. Furthermore, I am concerned because there are possible unintended consequences of a change like this, not all of which are easy to envision. One that can be cited, however, is the current draft-faltstrom-idnabis-tables-02.txt being discussed in the context of revision of IDNA. That document would treat these characters differently for IDN resolution, based on whether they were Script=Greek or Script=Common. We might be in the position of having a specification based on Unicode 5.0 that would run into what it would have claimed was a prohibited status change if updated to the Unicode 5.1 values. Regardless of whether that particular specification stands as currently drafted, this is precisely the kind of change we need to be watching out for, because there is a good chance that such changes would impact other specifications negatively and would be chalked up as another instance of "instability" in the Unicode Standard causing trouble for implementers. Decision To Make The UTC should determine whether the proposed change is desirable, despite the issues I have noted for it, or whether the Script property values for the 5 Greek compatibility accents in question should be left as is, despite the canonical equivalence inconsistency noted in L2/07-071.