Ambiguity and disunification (was: a commodious vicus of Hebrew recirculation from: Re: Unicode Stability)

From: Kenneth Whistler (
Date: Wed Mar 02 2005 - 14:29:31 CST

  • Next message: Asmus Freytag: "Re: Unicode Stability"

    Peter Kirk said:

    > The character is indeed ambiguous in Unicode 4.0. But it must no longer
    > be ambiguous in Unicode 4.1 or 5.0, because otherwise that would leave
    > two equally valid ways of spelling the same word.

    This is a fundamental misunderstanding. The standard does not work
    that way, and disunifications create no such effect or requirement.

    This is why Peter Constable has been claiming (correctly) that
    the particular disunification(s) in question do not invalidate
    any existing data. And failing to understand that seems to be
    leading to this utterly nonconverging argumentation that plagues the
    list with revisiting the same claims over and over and over again.

    I'll illustrate with an example that is much more accessible to
    most of the list participants than the arcana of Holam Haser and
    Holam Male.

    U+002D HYPHEN-MINUS is fundamentally ambiguous. It was used
    ambiguously in ASCII for years before it ever made it into
    ISO 8859-1 and hence into Unicode. The Unicode Standard introduced
    a disambiguating disunification, by adding U+2010 HYPHEN (a dash
    punctuation) and U+2212 MINUS SIGN (a math symbol and operator).

    The existence of U+2010 and U+2212 in the Unicode Standard does
    not invalidate existing ASCII-based data using U+002D. They do
    not create any obligation on users to spell things unambiguously
    with the new characters and to eschew the ambiguous U+002D. It
    merely allows them to, if they choose to do so and have the
    proper contexts and tools to make use of them.

    The existence of U+2212 MINUS SIGN results in "two equally valid
    ways of spelling" "-2", for instance. In some contexts, as in
    the C programming languages, for example, only one of those
    can actually be used, namely <002D, 0032>, and that is using
    the *ambiguous* character, because only U+002D is allowed in
    the formal syntax for expressions, and not U+2212. Some other
    contexts, such as formal algebraic systems and mathematical
    formula layout engines, may support both and distinguish clearly
    between them, even to the point of using different layout rules
    and/or glyphs for them. That is an implementation decision.

    What the Unicode Standard does not do and never *will* do,
    is deprecate U+002D HYPHEN-MINUS for its intended (ambiguous)
    usage, despite the fact that there are separately encoded
    characters for HYPHEN and for MINUS SIGN. It should be utterly
    obvious why the Unicode Standard will not do that: such a
    change would be completely destabilizing.

    Now take the discussion about Holams and plug them in at appropriate
    places for HYPHEN-MINUS and MINUS SIGN, and you end up with essentially
    the same situation -- only involving a much more obscure case in
    Hebrew instead of an obvious case in ASCII.

    And no, Dean, this is not an invitation to come re-argue the
    case that any disambiguating disunification should (or must)
    encode *both* of the disambiguated usages. It matters not
    whether a disunification proceeds as:

    X (:: A or B) ==> Y (:: A)

    or as:

    X (:: A or B) ==> Y (:: A)
                  ==> Z (:: B)
    In *either* case, you *still* are left with an X encoded, ambiguous
    between meaning "A" or "B". And data that makes use of that X,
    whether generated before *or* after the historical point that
    the disambiguating disunification decision was taken, may still
    be ambiguous in exactly the same way it was before such additions.

    > At the very least the
    > old representation, in this case Holam Haser on VAV represented as
    > HOLAM, should be clearly deprecated as an obsolete spelling no longer to
    > be supported.

    No it should not. Claiming that it should be is precisely what
    is leading Jony to keep coming back to the list claiming that
    this change represents a destabilization of existing Hebrew data.


    This archive was generated by hypermail 2.1.5 : Wed Mar 02 2005 - 14:31:21 CST