Re: Confusion about weak and strong disunification

From: Asmus Freytag (
Date: Sun Aug 16 2009 - 14:05:49 CDT

  • Next message: Joachim Durchholz: "Re: Matching opening and closing characters: How?"

    On 8/16/2009 5:18 AM, Shriramana Sharma wrote:
    > Hello. I am new to the Unicode list. Please be patient. I promise to
    > do my homework by googling at least once before asking questions here.
    > This is with reference to the following text from the P&P document N3452.
    > <quote>A strong case of disunification occurs where there is prevalent
    > practice of using the existing character. A weak case of
    > disunification occurs where there is little or no use of the existing
    > character for the purpose for which the new character is intended.
    > Example: Adding a period in a new script is a weak disunification if
    > we assume that nobody has an existing implementation of that script
    > using the regular period. Adding a clone of a Latin letter for use
    > with Cyrillic script is a strong disunification as mixed
    > Latin/Cyrillic character sets exist and have been used for encoding
    > the languages that the new characters are intended for.</quote>
    > I would like to know what exactly the adjectives "strong" and "weak"
    > are supposed to mean. Does "strong case" means that the case is highly
    > supportive of disunification or that strong reasons need to be
    > supplied before the disunification is accepted? Similarly for "weak
    > case".
    A "strong case" is not the same as a "strong disunification".
    "Implementation" in some sense means the existence and use of a
    "character set".

    For Cyrillic, many character sets exist (and have existed for a long
    time, even prior to Unicode) that contain _both_ the Latin alphabet and
    the Cyrillic alphabet. The shape "a" occurs in both alphabets, and has
    been encoded using two character codes. On the other hand the shape "z"
    is thought to occur only in one alphabet (the Latin) and is coded only
    once. If some not-so-well-known language has been written in Cyrillic,
    but using the "z" shape, all digitally encoded documents created would
    have to have used the "z" shape with the character code in the Latin
    alphabet section of those character sets.

    If, after many years of such practice, someone proposes a *new*
    character code for the same shape "z", to be used only when that shape
    occurs in the Cyrillic context, the P&P document calls that a "strong
    disunification" because suddenly, there's a choice for users, and old
    documents have, by force, made the "wrong" choice. As a result, the
    status of the "z" in the Latin alphabet section has changed, and in some
    contexts (like searching) one now needs to consider *two* characters as
    identical. "Strong" disunification means, the breaking apart of the uses
    of an existing character in a possibly disrupting way, and spreading
    them over two characters, one of them new.

    If a script is first computerized at the time it is encoded in Unicode,
    then adding a clone of the period does not disrupt users of 002E
    (standard period) in the same way - there are no old (digitally encoded)
    documents to worry about, and users of the new scripts know from day one
    to use the new character. This is a "weak" disunification (in the sense
    of the P&P document) because the 002E had never been used in the context
    of the new script before.

    A proposal that results in a strong disunification requires a very
    strong case in favor.

    However, even a proposal that results in a weak disunification still
    requires a justification - for example, there are many scripts that use
    Western punctuation "as is" and therefore those code points don't
    necessarily need to be duplicated.


    This archive was generated by hypermail 2.1.5 : Sun Aug 16 2009 - 14:08:54 CDT