L2/07-241

Source: Jonathan Kew
Date: 2007-07-31
Subject: undo mirroring of quote marks

- - - - - -

The Unicode standard has had a long-standing practice of encoding
"curly quotes" according to their visual appearance as "left" or
"right" (where these terms are based on English-style usage; looking
at <http://en.wikipedia.org/wiki/Quotation_mark%2C_non- 
English_usage>, we see that an English "left quotation mark" would be
considered a "right quotation mark" in Czech, for example). Because
of the diversity of usage, even within left-to-right Latin-script
text, it was impractical to encode them semantically as "opening" and
"closing" codes. Many other characters that come in "left/right"
pairs, such as parentheses and brackets, are regarded as really being
"opening/closing" (despite their names), and are mirrored when used
in right-to-left text, but this was explicitly not done with curly
quotes.

This situation existed until Unicode 4.1, and was clearly
intentional; the non-mirroring nature of the curly quotes was
mentioned specifically in the Standard. (See TUS 5.0, pp.203, 209;
the text was not updated for 5.0!)

For Unicode 5.0, however, the bidi mirroring property of the curly
quotes U+2018..201F was changed in UnicodeData.txt so that these
characters are now mirrored in right-to-left text.

This change has introduced a significant incompatibility between
Unicode 5.0 and earlier versions, such that right-to-left text
encoded with versions up through 4.1 will display incorrectly on a
Unicode 5.0 system; and text encoded according to version 5.0 (but
not using any newly-introduced characters) will display incorrectly
under earlier versions. It breaks significant amounts of existing
data (any right-to-left text that uses curly quotes), and makes it
impossible to reliably render text in scripts such as Arabic or
Hebrew without knowing the version of Unicode that was intended by
the creator of that data. This is an unacceptable situation, and the
effects are just beginning to be seen as Unicode 5.0-based systems
are deployed more widely.

Even under the new definition of the curly quote characters, it is
still impossible to know whether a given character represents an
"opening" or "closing" quote, because of the diversity of usage
patterns. Thus, it is difficult to see what benefit has been gained
by the change, and the cost -- in broken data and incompatible
implementations -- will go on being paid for years to come.

I cannot see any way to reconcile the pre-5.0 and 5.0 versions of the
curly quote characters; indeed, from a right-to-left script user's
point of view, it is arguable that the property change has
significantly redefined the identities of these characters, in
violation of the Unicode stability policy. If we consider a run of
Arabic text, containing a word in curly double quotes, Unicode 4.1
would identify the opening (right-hand) quote as U+201D, and the
closing (left-hand quote as U+201C. But looking at the same text in
the light of Unicode 5.0, these two codes are exchanged; the
characters have swapped their identities.

Of course, it is not as simple as this in the overall picture: when
used in left-to-right text, U+201C and U+201D have NOT exchanged
identities. What has really happened is a re-analysis of the set of
curly quote marks, so that the collection of encoded text elements is
no longer the same as it was -- nor is it even a simple sub- 
partitioning of the previous repertoire, as would be the case with a
normal disunification.

I suggest that this property change was a mistake, and has created
intractable problems of incompatibility for the Standard.

As there is already a great deal of right-to-left data encoded
according to pre-5.0 versions of Unicode, I request that this change
be reverted as a matter of urgency. This would allow implementers can
begin undoing the damage as soon as possible, confident that all
future versions of the standard will follow the 1.0-4.1 definition of
the curly quote characters.

So I suggest that Unicode 5.1 should revert to the non-mirroring
definition of curly quotes. Issuing a corrigendum even before this would
help further to contain the damage, as some vendors may be able to
patch systems before the full 5.1 release, or change not-yet-deployed
systems. It is unfortunately impossible to undo the fact that there
are already incompatible implementations in existence, and
incompatible data being created. But if the problem is acknowledged
and corrected quickly, the amount of "5.0-style" data can be
minimized, and users encouraged to update that data before it is
widely disseminated and becomes deeply buried in archives, etc.,
where there will forever be doubt as to the proper interpretation of
the curly quote characters.


Jonathan Kew