Re: Ambiguity and disunification

From: Dean Snyder (
Date: Fri Mar 04 2005 - 07:30:11 CST

  • Next message: Arcane Jill: "Re: Small Java implementation of NFC"

    Kenneth Whistler wrote at 1:02 PM on Thursday, March 3, 2005:

    ><hyphen> to the repertoire does not change the meaning of <hyphen/minus>,
    >nor does it change the interpretation of text which may have used
    >a <hyphen/minus> character before a distinct <hyphen> was encoded.
    >What we have ended up with is 3 characters. One of those is the
    >legacy character that represented an encoding compromise very early
    >on (itself derived from a typewriter keyboard design limitation)
    >which reflected a willingness to put up with ambiguous usage that
    >didn't reflect actual typographical practice, in order to gain
    >the manifest benefits of typewriters (and later computers and
    >digital text representation).
    >These 3 characters are now distinct in Unicode and have distinct
    >interpretations and properties:
    >U+002D HYPHEN-MINUS gc=Pd, bc=ES, lb=HY
    >U+2010 HYPHEN gc=Pd, bc=ON, lb=BA
    >U+2212 MINUS SIGN gc=Sm, bc=ES, lb=PR

    [Here Ken follows with several playful examples of multiple, ambiguous
    uses of various dashes:]

    All of which either misses or ignores the point of my example, where I
    stated very clearly, "If only hyphen/minus and hyphen have been encoded"
    - a hypothetical scenario that unhappily precisely matches what Unicode
    is actually, and inconsistently, doing in some instances (Hebrew and
    cuneiform), but not in others (the dashes).

    It is, of course, very significant that Unicode, correctly, DID NOT do an
    incomplete disambiguation of the dashes (keeping as they did the original
    ambiguous one while adding the disambiguated usages), but sadly they are
    proceeding with incomplete disambiguations in Hebrew and cuneiform. And
    any number of playful or whimsical dash examples do not negate this sober

    By the way, I have only been using hypothetical dash examples because
    others started with them first and many here are not familiar with the
    arcana of the actual (soon-to-be) encoded Hebrew and cuneiform examples
    that fit the scenario under discussion.

    Nevertheless, despite all of Ken's supposed counter examples of dash
    usage my point still holds. I will try to spell it out (still using dash
    examples) so explicitly that one cannot miss the point.

    In a single plain text passage, presuming an incomplete encoding where
    only hyphen/minus and hyphen are encoded, if an author meticulously uses
    hyphen/minus for minus but hyphen everywhere else, one presumes he is
    using them contrastively thereby following the raison d'etre for the new,
    partial-disambiguation encoding model itself.

    If you however cut from that text the phrase "2-3", i.e. "2 hyphen/minus
    3", and place it into a context where only hyphen/minus is used (a
    context that ignores the new encoding model) you have now lost the
    original author's contrast and will not know in the new text, without a
    context-bound analysis, whether or not this phrase should be interpreted
    as "2 to 3" or "minus 1". What will you do now if you need to interpret
    that phrase before entering it as a value in a spreadsheet or database?
    You will need to do a context-bound analysis, perhaps even a human one at

    But it will be argued that, at least in the original document, the
    correct values can be determined programmatically, unambiguously, and by
    context-free processes.

    Not so fast. This is where the insidiousness of the partial
    disambiguation encoding model rears its ugly head.

    The problem comes when we have an author who, although he meticulously
    follows the new, incomplete disambiguation encoding model and always uses
    hyphen/minus for minus and hyphen for everything else, happens to produce
    a text in which there is no hyphen. Unless you know the author and are
    familiar with his practices, or are informed by someone, or do a context-
    bound analysis of the text itself, you will not know that the author has
    meticulously followed the new model and there are indeed no hyphens in
    this text.

    In a sense that makes the partial disambiguation style of encoding even
    worse than just leaving the original ambiguities in place, because you
    have to live with the uncertainty as to the trustworthiness of a
    particular text unless there is at least ONE contrastive usage in place.
    That's what I mean when I say that the ambiguity is compounded by an
    incomplete disambiguation encoding model.

    Happily and wisely, though, Unicode did not do this with the dashes -
    they should follow their own precedent and not incompletely disambiguate
    Hebrew, cuneiform, etc.


    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897

    This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 11:12:35 CST