Problematic Change in Final_Cased Condition
Markus Scherer, Mark Davis
The condition for lowercasing of a final sigma changed, showing a number of problems.
Between Unicode 3.2/UAX #21 (tr21-5) and Unicode 4 Table 3-13 (p. 89 [see the online version of chapter 3]), the condition called Final_Sigma changed to a substantially different condition called Final_Cased. While Final_Sigma could be implemented with simple property lookups, Final_Cased requires the instantiation of a higher-level word boundary implementation.
There are several problems with this:
This change was important, but it was not made clear to implementers what was changing. The new Final_Cased condition is only defined in the book in PDF — not in the data file or UAX — and was not listed as a significant change. There should always be a list of such substantial changes to the standard. Without such a list, implementers will not realize that they need to make a change.
- The new Final_Cased is harder to implement than the old Final_Sigma. Checking for word boundaries is much more complex than checking for general categories and combining class values of individual characters.
- The new Final_Cased condition is not clear: It mixes character tests with character boundary tests and applies a regular expression operator (*) to the boundary test. It does not specify, for example, how a * operator is to be applied to a word boundary.
- The Unicode 4 data file SpecialCasing-4.0.0.txt shows the condition as Final_Sigma which is technically not defined in Unicode 4.
- Revert back to the tr21-5 Final_Sigma condition.
While it may not be optimally aligned with other parts of the Unicode standard, it appears to work equally well for normal Greek text. We advise against another, yet different definition because that would cause implementations to diverge even more, without any benefit for normal text.
- Document the case mapping conditions in comments in SpecialCasing.txt.
Implementers diff UCD files when they update to a new version of the standard. While comments are not machine-readable, they show up prominently in a diff and thus serve as a very visible marker for an implementer.
- For each future version of the Unicode standard, provide a detailed list of all substantial changes that cannot be determined from diffing the UCD files.