L2/07-202
                                          
Date:   June 14, 2007

Title:  Script Consistency Issue for Compatibility Greek Accents

Source: Ken Whistler

Action: For consideration by the UTC


References: L2/07-071


Background

Mark Davis and I had an action item (110-A092) to update
a number of character properties, based on the canonical
checking results reported in L2/07-071.

In the process of double-checking the results of the various
changes and derivations, I turned up a countervailing
consistency issue in the compatibility Greek accents
which is troublesome, and which I would like the UTC
to discuss.

The issue arises because of the proposed change (for
Unicode 5.1.0) of the script property for five of the Greek
compatibility accents from Script=Greek to Script=Common.
The four characters in question are 1FC1, 1FED..1FEF, and 1FFD.

The reason for proposing this change in the first place is
that those four characters have canonical equivalences to
other spacing accents that have Script=Common. In
particular:

1FC1 = <00A8, 0342>
1FED = <00A8, 0300>
1FEE = 0385 = <00A8, 0301>
1FEF = 0060
1FFD = 00B4

where 00A8, 0060, and 00B4 all have Script=Common.

The problem is that this change then leaves behind other
compatibility Greek accents which *don't* have canonical
equivalences to spacing accents with Script=Common, leading
to an inconsistency in script handling for this batch of
characters.

In particular, some of the relevant entries from Scripts.txt
are show here.

Unicode 5.0.0 (current standard)

1FBF..1FC1    ; Greek # Sk   [3] GREEK PSILI..GREEK DIALYTIKA AND PERISPOMENI
1FED..1FEF    ; Greek # Sk   [3] GREEK DIALYTIKA AND VARIA..GREEK VARIA
1FFD..1FFE    ; Greek # Sk   [2] GREEK OXIA..GREEK DASIA

Unicode 5.1.0 (proposed on the basis of L2/07-071)

1FC1          ; Common # Sk       GREEK DIALYTIKA AND PERISPOMENI
1FED..1FEF    ; Common # Sk   [3] GREEK DIALYTIKA AND VARIA..GREEK VARIA
1FFD          ; Common # Sk       GREEK OXIA
1FBF..1FC0    ; Greek # Sk   [2] GREEK PSILI..GREEK PERISPOMENI
1FFE          ; Greek # Sk       GREEK DASIA


I don't think that the proposed resolution, based strictly on
the canonical equivalence relationships for these spacing compatibility
characters makes much sense when considered in the context of
the Greek Extended block itself. It will be seen as arbitrarily
removing 5 Greek accent marks from the Greek script, while leaving
3 others as Greek. And it will be nearly impossible for implementers
of the standard to track down and make sense of the reasoning
for the differentiation.

The distinction makes even less sense when one considers the results
of doing an NFKC (or NFKD) normalization on these characters. Because
all of the spacing accent marks expand in NFKC to sequences with SPACE
(except U+0060, which was excepted because it is an ASCII character),
you end up with:

1FC1 --> 0020 0308 0342   ; Common # GREEK DIALYTIKA AND PERISPOMENI
1FED --> 0020 0308 0300   ; Common # GREEK DIALYTIKA AND VARIA
1FEE --> 0020 0308 0301   ; Common # GREEK DIALYTIKA AND OXIA
1FEF --> 0060             ; Common # GREEK VARIA
1FFD --> 0020 0301        ; Common # GREEK OXIA
1FBF --> 0020 0313        ; Greek  # GREEK PSILI..GREEK PERISPOMENI
1FC0 --> 0020 0342        ; Greek  # GREEK PSILI..GREEK PERISPOMENI
1FFE --> 0020 0314        ; Greek  # GREEK DASIA

where in the NFKC (or NFKD) forms SPACE and U+0060 are Script=Common,
the combining marks are all Script=Inherited
and none of the characters are Script=Greek.

While there is no compelling reason why a compatibly equivalent
sequence must have the same script property value, this particular
inconsistency is troubling to me, and I suspect that the proposed
change in script property values will be met with general
headscratching and confusion by outside users of the standard.

Furthermore, I am concerned because there are possible unintended
consequences of a change like this, not all of which are easy
to envision. One that can be cited, however, is the current
draft-faltstrom-idnabis-tables-02.txt being discussed in the
context of revision of IDNA. That document would treat these
characters differently for IDN resolution, based on whether
they were Script=Greek or Script=Common. We might be in the
position of having a specification based on Unicode 5.0 that
would run into what it would have claimed was a prohibited
status change if updated to the Unicode 5.1 values.

Regardless of whether that particular specification stands as
currently drafted, this is precisely the kind of change we need
to be watching out for, because there is a good chance that
such changes would impact other specifications negatively and
would be chalked up as another instance of "instability" in
the Unicode Standard causing trouble for implementers.

 
Decision To Make

The UTC should determine whether the proposed change is
desirable, despite the issues I have noted for it, or
whether the Script property values for the 5 Greek
compatibility accents in question should be left as is,
despite the canonical equivalence inconsistency noted
in L2/07-071.