Re: Ambiguity and disunification

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 03 2005 - 15:02:10 CST

  • Next message: Kenneth Whistler: "Re: Unicode Stability (Was: Re: E0000 Language Tags for Some Obscure Languages)"

    > >Dean Snyder wrote:
    > >> Wrong - if you encode only one of the disambiguated usages you have
    > >> actually INCREASED the ambiguity of the original character; it now has
    > >> not only its original ambiguous significance but ALSO a new context-bound
    > >> unambiguous significance opposite the newly encoded character's
    > >> significance. In addition, there is no way to represent all three usages
    > >> (one ambiguous, two unambiguous) in the same plain text passage.
    > >>

    And Gregg Reynolds asked:

    > >You lost me there. If I have <hyphen/minus> and <hyphen>, for example,
    > >there's nothing ambiguous about the fact that the former is ambiguous
    > >(bi-semous?) and the latter not. How does adding <hyphen> to the
    > >repertoire change the meaning of <hyphen/minus>? Have I misunderstood
    > >something?

    Actually, I don't think you have misunderstood anything. Adding
    <hyphen> to the repertoire does not change the meaning of <hyphen/minus>,
    nor does it change the interpretation of text which may have used
    a <hyphen/minus> character before a distinct <hyphen> was encoded.

    What we have ended up with is 3 characters. One of those is the
    legacy character that represented an encoding compromise very early
    on (itself derived from a typewriter keyboard design limitation)
    which reflected a willingness to put up with ambiguous usage that
    didn't reflect actual typographical practice, in order to gain
    the manifest benefits of typewriters (and later computers and
    digital text representation).

    These 3 characters are now distinct in Unicode and have distinct
    interpretations and properties:

    U+002D HYPHEN-MINUS gc=Pd, bc=ES, lb=HY

    U+2010 HYPHEN gc=Pd, bc=ON, lb=BA

    U+2212 MINUS SIGN gc=Sm, bc=ES, lb=PR

    The bidirectional category reflects the role of U+002D and U+2212
    in formatting numbers. The line break values reflect the differences
    between U+2010, which presents a break after opportunity, and
    U+2212, which tends to stick with any number that follows it, if
    not separated by a space. Line break value lb=HY represents the
    ambiguous status of U+002D: "Some additional context analysis is
    required to distinguish usage of this character as a hyphen from
    the use as minus sign (or indicator of numerical range). If used
    as hyphen, it acts like HYPHEN." -- from UAX #14.

    The properties for U+002D are a compromise for processing the
    character, since its ambiguity can result in different behavior,
    depending on context. That ambiguity inheres to the character
    itself, and is not dependent on the fact that U+2010 and U+2212
    also exist with somewhat different behaviors.

    Actually, there is even a 3-way ambiguity, because U+002D is
    also rather indifferently used for an en-dash, which is properly:

    U+2013 EN DASH gc=Pd, bc=ON, lb-BA

    This character is very similar to HYPHEN in terms of properties
    and processing, but typically has a distinct glyph from HYPHEN.
    It is preferred for use in range notations, like this: 1972-1994,
    whereas HYPHEN is preferred for actual hyphens, like this:
    time-tested. (Both conveyed in this 8859-1 e-mail as 0x2D = U+002D.)

    Dean responded:

    > If only hyphen/minus and hyphen have been encoded, and you have in a
    > single plain text passage a hyphen/minus along with a hyphen the hyphen/
    > minus has two possible interpretations - one as the original ambiguous
    > hyphen/minus or the other as an unambiguous minus contrasting with the
    > hyphen character. Thus its possible interpretations are controlled by the
    > presence or absence of a hyphen character in the same plain text passage.
    > That is context-bound and frankly stateful.

    This is an incorrect analysis.

    If I have the following plain text snippet:

         "Both conveyed in this 8859-1 e-mail as 0x2D = U+002D."
                                    ^002D
                                        ^2010

    (where I am presuming that the text is encoded in Unicode and I have
    indicated one "dash" as being represented with U+002D and one "dash"
    as being represented with U+2010)

    This piece of text meets Dean's premise. However, the first dash is
    *un*ambiguously identified as U+002D HYPHEN-MINUS and the second
    dash is *un*ambiguously identified as U+2010 HYPHEN. (I can determine
    that simply by examining the actual content of codes in the backing
    store.)

    Their existence in that piece of text says nothing about the continuing
    status of the encoded characters U+002D HYPHEN-MINUS and U+2010 HYPHEN
    in the standard, and their relationship to each other. U+002D continues
    to be present in the standard, ambiguously used for any of several
    rather distinct usages, whereas U+2010 HYPHEN is intended for only
    one of those.

    Furthermore, the first dash is U+002D being used for an en dash,
    rather than a minus sign. And there is nothing preventing me from
    continuing to elaborate on the multiple uses:

       "Conveyed in this time-tested 8859-1 e-mail on 3-2-2005 at 17:20GMT -8."
       
    Now those happen to all be 0x2D in this e-mail, but in Unicode I could
    either convert all those 0x2D's to U+002D, or I could choose to
    sort them all out, using en dashes, minus signs, and hyphens, as
    appropriate.

    And users being who they are, anyone, including me, could end up
    making incorrect choices as well, so that I might end up with
    an EN DASH in "time-tested" and a MINUS SIGN in "8859-1", for example.

    The interpretation and identity of the *encoded characters* in any
    of these instances depends on the standard itself, and is one
    issue.

    The interpretation of the function of the punctuation and symbols
    in textual context is quite another.

    So I can say things like, "The '-' in the word 'time-tested' is a
    hyphen." And that can be correct, even if I happened to represent
    the text in a Unicode data store as U+2212 MINUS SIGN.

    Ambiguity in the interpretation of text in context is quite a
    different thing than ambiguity in the interpretation of an
    encoded character in the standard.

    > >Lost me again. Be patient. Are you saying that the presence of e.g.
    > ><hyphen> in a string of text somehow affects the meaning of
    > ><hyphen/minus>?

    It does not.

    >
    > It affects the number of choices available - the presence of a hyphen
      ^^
      
    The presence of a U+2010 HYPHEN in a string of text does not
    affect the number of choices available.

    The historical fact of the disunification of U+002D HYPHEN-MINUS
    by the additional encodings of U+2010 HYPHEN and U+2212 MINUS SIGN
    (and U+2013 EN DASH) affects the number of felicitous representations
    of a particular piece of text with a '-' in it.

    > adds one more choice for interpretation; its absence leaves you guessing
    > as to its usage (is hyphen/minus being used contrastively with hyphen or
    > not?).

    And this confuses the issue. It isn't the presence of the alternatives
    in the standard that controls this -- it is the usage in text which
    controls it.

    --Ken
    ^^
    A sequence of two <002D, 002D>, not used contrastively with em dash.
    (Note that Dean himself, in his e-mail, uses a single U+002D as
    an em dash.)

    ;-)
     ^ oops, there's another one -- the HYPHEN-MINUS SMILEY NOSE
     



    This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 15:03:23 CST