L2/07-241 Source: Jonathan Kew Date: 2007-07-31 Subject: undo mirroring of quote marks - - - - - - The Unicode standard has had a long-standing practice of encoding "curly quotes" according to their visual appearance as "left" or "right" (where these terms are based on English-style usage; looking at , we see that an English "left quotation mark" would be considered a "right quotation mark" in Czech, for example). Because of the diversity of usage, even within left-to-right Latin-script text, it was impractical to encode them semantically as "opening" and "closing" codes. Many other characters that come in "left/right" pairs, such as parentheses and brackets, are regarded as really being "opening/closing" (despite their names), and are mirrored when used in right-to-left text, but this was explicitly not done with curly quotes. This situation existed until Unicode 4.1, and was clearly intentional; the non-mirroring nature of the curly quotes was mentioned specifically in the Standard. (See TUS 5.0, pp.203, 209; the text was not updated for 5.0!) For Unicode 5.0, however, the bidi mirroring property of the curly quotes U+2018..201F was changed in UnicodeData.txt so that these characters are now mirrored in right-to-left text. This change has introduced a significant incompatibility between Unicode 5.0 and earlier versions, such that right-to-left text encoded with versions up through 4.1 will display incorrectly on a Unicode 5.0 system; and text encoded according to version 5.0 (but not using any newly-introduced characters) will display incorrectly under earlier versions. It breaks significant amounts of existing data (any right-to-left text that uses curly quotes), and makes it impossible to reliably render text in scripts such as Arabic or Hebrew without knowing the version of Unicode that was intended by the creator of that data. This is an unacceptable situation, and the effects are just beginning to be seen as Unicode 5.0-based systems are deployed more widely. Even under the new definition of the curly quote characters, it is still impossible to know whether a given character represents an "opening" or "closing" quote, because of the diversity of usage patterns. Thus, it is difficult to see what benefit has been gained by the change, and the cost -- in broken data and incompatible implementations -- will go on being paid for years to come. I cannot see any way to reconcile the pre-5.0 and 5.0 versions of the curly quote characters; indeed, from a right-to-left script user's point of view, it is arguable that the property change has significantly redefined the identities of these characters, in violation of the Unicode stability policy. If we consider a run of Arabic text, containing a word in curly double quotes, Unicode 4.1 would identify the opening (right-hand) quote as U+201D, and the closing (left-hand quote as U+201C. But looking at the same text in the light of Unicode 5.0, these two codes are exchanged; the characters have swapped their identities. Of course, it is not as simple as this in the overall picture: when used in left-to-right text, U+201C and U+201D have NOT exchanged identities. What has really happened is a re-analysis of the set of curly quote marks, so that the collection of encoded text elements is no longer the same as it was -- nor is it even a simple sub- partitioning of the previous repertoire, as would be the case with a normal disunification. I suggest that this property change was a mistake, and has created intractable problems of incompatibility for the Standard. As there is already a great deal of right-to-left data encoded according to pre-5.0 versions of Unicode, I request that this change be reverted as a matter of urgency. This would allow implementers can begin undoing the damage as soon as possible, confident that all future versions of the standard will follow the 1.0-4.1 definition of the curly quote characters. So I suggest that Unicode 5.1 should revert to the non-mirroring definition of curly quotes. Issuing a corrigendum even before this would help further to contain the damage, as some vendors may be able to patch systems before the full 5.1 release, or change not-yet-deployed systems. It is unfortunately impossible to undo the fact that there are already incompatible implementations in existence, and incompatible data being created. But if the problem is acknowledged and corrected quickly, the amount of "5.0-style" data can be minimized, and users encouraged to update that data before it is widely disseminated and becomes deeply buried in archives, etc., where there will forever be doubt as to the proper interpretation of the curly quote characters. Jonathan Kew