From: Kenneth Whistler (
Date: Fri Dec 07 2007 - 15:09:23 CST

  • Next message: Kenneth Whistler: "RE: Questions regarding U+FD3E ORNATE LEFT PARENTHESIS and U+FD3F ORNATE RIGHT PARENTHESIS"

    Karl Pentzlin asked:

    > This leads to my questions:
    > a.) Why U+FD3E has GC property Ps and U+FD3F has Pe, and not vice
    > versa?
    > b.) Why U+FD3E and U+FD3F have the Bidi_mirroring property not set?

    There is lots of history for this -- some of which was
    hinted at by the other responses.

    The most recent round on this dates to PRI #80, which was
    closed in February 2006. PRI #80 requested feedback on a
    proposal to change the Bidi_Mirrored property for a bunch
    of characters that had until then been Bidi_Mirrored=False,
    including a number of directional quotation marks and
    these two ornate parentheses.

    Part of the feedback received on PRI #80 was a specific
    request *not* to change the Bidi_Mirrored property for the two
    parentheses -- feedback received from Roozbeh Pournader on
    behalf of the High Council of Informatics of Iran -- because
    doing so would invalidate all existing usage of them
    in Persian. It would put them in conflict with the Iranian
    standard, which had been deliberately aligned with Unicode
    for bidirectional behavior.

    As a result of that feedback, in the resolutions dealing
    with PRI #80, the UTC explicitly decided to exempt the
    two parentheses from the change proposed for the other
    characters (mostly the quotation marks):

    "106-M3 Motion: Drop U+FD3E ORNATE LEFT PARENTHESIS and
    U+FD3F ORNATE RIGHT PARENTHESIS from the list of
    characters with Bidi_Mirrored property proposed in Public
    Review Issue 80."

    That motion carried. And as a result those characters were
    not changed to Bidi_Mirrored=True in Unicode 5.0.

    That fact that (controversially) some quotation marks were
    changed to Bidi_Mirrored=True as a result of PRI #80, and
    that that decision was itself later reversed is related to the
    question of the Bidi_Mirrored property for the two ornate
    parentheses, of course, but the actual decision paths taken
    for the two different sets of characters forked as of
    that February, 2006 motion.

    As to the *ancient* history here, the fact is that the ornate
    parentheses have never changed property values in the standard.

    UnicodeData-2.0.14.txt (July 1996)


    UnicodeData-5.0.0.txt (July 2006)


    The assignment of gc=Ps and gc=Pe was not a mistake. All
    left/right parenthesis and bracket pairs are given gc=Ps for
    the "LEFT" of the pair and gc=Pe for the "RIGHT" of the pair.

    The reason why these two were not given the Bidi_Mirrored
    property in the first place is pretty simple: it was an
    bit of a catch-22 in late 1995 when data files for Unicode 2.0
    were being developed. UnicodeData-2.0.14.txt was synched
    with what became Table 4-7, Mirrored Characters, in the printed text of
    Unicode 2.0, pp. 4-22..4-25. Significantly, *that* also
    had to be synchronized with Annex C in ISO/IEC 10646-1:1993,
    which *also* provided an explicit list of "Mirrored characters
    in Arabic bi-directional context". And *that* corresponded
    to the list in Appendix G, "Symmetric Swapping Characters"
    in Unicode 1.1 (1993), which was deliberately checked to
    assure it was the same as Annex C in 10646-1:1993.

    The machine-readable form of this goes back to June, 1994, when
    the availability of what is now archived as UnicodeData-1.1.5.txt
    was announced:

    "We would like to announce the availability of a new updated and comprehensive
    data file for Unicode characters. This file not only includes all the
    information in previous names list, but also incorporates the Unicode 1.1 data
    for each character: canonical ordering priority, bidirectional categories,
    character decomposition, numeric values, symmetric swapping. It also adds new
    information on character categorization and upper/lower/title case mappings."

    But the list goes back way further. The list in 10646-1:1993
    was published in DIS 10646-1.2:1992 (26 December 1991), so
    it was actually developed in 1991, during the period of
    hectic and shall we say "stimulating" ;-) merger discussions
    for Unicode and 10646.

    The existence of FD3E and FD3F as a pair of parentheses in
    the then-draft standard was simply overlooked by the people
    working on finding all the mirrored characters
    for the 10646-1 Annex listing them. In fact the *parenthesis* and
    *bracket* characters were also explicitly listed in the normative
    Clause 20 of the 10646 standard at the time, as well. This
    was being done at a time when there were still no machine-readable
    files for character properties, nor any explicit written
    out specification of the Bidirectional Algorithm.

    The whole issue was murkified at the time because there was
    an argument going on whether compatibility with
    implementations that controlled glyph swapping with explicit
    control codes was also required.

    DIS 10646-1.2:1992 did not contain symmetric swapping controls,
    but I note the following comment in minutes from the
    UTC New Scripts Committee meeting of May 2, 1992:

    "A lively discussion of mirrored characters in Unicode BIDI ensued. The
    subcommittee decided that although we believe codes to control mirroring are
    at the wrong level, we would agree to support adding them if required in
    order to make the merger with 10646 work. These additions were proposed by
    Israel, as amended by Isai Scheinberg."

    And, in fact that was the origin of the following two characters
    which ended up in 10646-1:1993:


    Of course those characters were always deprecated from the point
    of view of the Unicode Bidirectional Algorithm, but the fact
    that people were still arguing about including them as of 1992
    indicates that at the time the architecture for bidirectional
    behavior was still somewhat in play, and people tended to focus
    on that to the detriment of detailed examination of consistency
    of property assignments.

    Another thing to note is that as of 1993, characters higher
    than U+DFFF were still officially designated as the "R-zone"
    (or "Restricted Use zone") -- a kind of toxic zone where
    private use, presentation forms, and compatibility characters
    lived their stunted lives. It would have been difficult at the
    time to get people to take those characters as serious
    participants in property lists, particularly ones listed
    as "presentation forms", which were taken just to be glyphs,
    basically. It was only later that implementations started
    cherry-picking the R-zone for "real" characters, and eventually
    the R-zone designation itself was dropped and many other
    characters started getting encoded in the high parts of the BMP.

    So, to make a very, very long story somewhat shorter, there
    was an early convergence of effects that resulted in the
    situation where FD3E/FD3F did not have Bidi_Mirrored=True
    as of Unicode 2.0:

    1. Oversight by those drafting the initial mirrored list for 10646.
    2. The fact that these were encoded among Arabic presentation
       forms in the R-zone, which was a strike against them even
       if they had been noticed.
    3. Concern by the UTC in the development of Unicode 2.0 to
       focus on consolidation and correct synchronization, rather
       than putting too many issues in play by trying to "fix" things
       all at once, particularly for Bidi, which was difficult
       and painful at the time. (When isn't it? *hehe*)
    There was a window of opportunity between Unicode 2.0 and
    Unicode 3.0, when the UTC *did* focus on updating and correcting
    a lot of character properties, and adding lots of new ones.
    And during that time, the process for synchronizing work between
    WG2 and UTC started to get ironed out, so it probably would have
    been possible to fix the Bidi_Mirrored property for FD3E/FD3F
    then, had anybody actually noticed.

    But after the publication of Unicode 3.0, implementation
    stability started to matter more and more with the passing
    years. Normalization was already locked down, and bidirectional
    properties started to get hard to modify. By now, implementations
    are widespread enough, and there is enough data that essentially
    nobody wants to "fix" things any more. It is far less damaging
    to live with past mistakes in properties than to try to "fix" them now --
    which tends to have the paradoxical effect of breaking
    implementations and invalidating data.

    Or to put it very shortly, this saga is another illustration
    of the Mick Jagger Principle of Character Encoding:

      You can't always get what you want!

    This archive was generated by hypermail 2.1.5 : Fri Dec 07 2007 - 15:12:26 CST