Re: Ambiguity and disunification

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 03 2005 - 15:02:10 CST

Next message: Kenneth Whistler: "Re: Unicode Stability (Was: Re: E0000 Language Tags for Some Obscure Languages)"

Previous message: Dean Snyder: "Re: Ambiguity and disunification"
Maybe in reply to: Dean Snyder: "Re: Ambiguity and disunification"
Next in thread: Dean Snyder: "Re: Ambiguity and disunification"
Reply: Dean Snyder: "Re: Ambiguity and disunification"
Reply: Patrick Andries: "double hyphen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> >Dean Snyder wrote:
> >> Wrong - if you encode only one of the disambiguated usages you have
> >> actually INCREASED the ambiguity of the original character; it now has
> >> not only its original ambiguous significance but ALSO a new context-bound
> >> unambiguous significance opposite the newly encoded character's
> >> significance. In addition, there is no way to represent all three usages
> >> (one ambiguous, two unambiguous) in the same plain text passage.
> >>

And Gregg Reynolds asked:

> >You lost me there. If I have <hyphen/minus> and <hyphen>, for example,
> >there's nothing ambiguous about the fact that the former is ambiguous
> >(bi-semous?) and the latter not. How does adding <hyphen> to the
> >repertoire change the meaning of <hyphen/minus>? Have I misunderstood
> >something?

Actually, I don't think you have misunderstood anything. Adding
<hyphen> to the repertoire does not change the meaning of <hyphen/minus>,
nor does it change the interpretation of text which may have used
a <hyphen/minus> character before a distinct <hyphen> was encoded.

What we have ended up with is 3 characters. One of those is the
legacy character that represented an encoding compromise very early
on (itself derived from a typewriter keyboard design limitation)
which reflected a willingness to put up with ambiguous usage that
didn't reflect actual typographical practice, in order to gain
the manifest benefits of typewriters (and later computers and
digital text representation).

These 3 characters are now distinct in Unicode and have distinct
interpretations and properties:

U+002D HYPHEN-MINUS gc=Pd, bc=ES, lb=HY

U+2010 HYPHEN gc=Pd, bc=ON, lb=BA

U+2212 MINUS SIGN gc=Sm, bc=ES, lb=PR

The bidirectional category reflects the role of U+002D and U+2212
in formatting numbers. The line break values reflect the differences
between U+2010, which presents a break after opportunity, and
U+2212, which tends to stick with any number that follows it, if
not separated by a space. Line break value lb=HY represents the
ambiguous status of U+002D: "Some additional context analysis is
required to distinguish usage of this character as a hyphen from
the use as minus sign (or indicator of numerical range). If used
as hyphen, it acts like HYPHEN." -- from UAX #14.

The properties for U+002D are a compromise for processing the
character, since its ambiguity can result in different behavior,
depending on context. That ambiguity inheres to the character
itself, and is not dependent on the fact that U+2010 and U+2212
also exist with somewhat different behaviors.

Actually, there is even a 3-way ambiguity, because U+002D is
also rather indifferently used for an en-dash, which is properly:

U+2013 EN DASH gc=Pd, bc=ON, lb-BA

This character is very similar to HYPHEN in terms of properties
and processing, but typically has a distinct glyph from HYPHEN.
It is preferred for use in range notations, like this: 1972-1994,
whereas HYPHEN is preferred for actual hyphens, like this:
time-tested. (Both conveyed in this 8859-1 e-mail as 0x2D = U+002D.)

Dean responded:

> If only hyphen/minus and hyphen have been encoded, and you have in a
> single plain text passage a hyphen/minus along with a hyphen the hyphen/
> minus has two possible interpretations - one as the original ambiguous
> hyphen/minus or the other as an unambiguous minus contrasting with the
> hyphen character. Thus its possible interpretations are controlled by the
> presence or absence of a hyphen character in the same plain text passage.
> That is context-bound and frankly stateful.

This is an incorrect analysis.

If I have the following plain text snippet:

     "Both conveyed in this 8859-1 e-mail as 0x2D = U+002D."
                                ^002D
                                    ^2010

(where I am presuming that the text is encoded in Unicode and I have
indicated one "dash" as being represented with U+002D and one "dash"
as being represented with U+2010)

This piece of text meets Dean's premise. However, the first dash is
*un*ambiguously identified as U+002D HYPHEN-MINUS and the second
dash is *un*ambiguously identified as U+2010 HYPHEN. (I can determine
that simply by examining the actual content of codes in the backing
store.)

Their existence in that piece of text says nothing about the continuing
status of the encoded characters U+002D HYPHEN-MINUS and U+2010 HYPHEN
in the standard, and their relationship to each other. U+002D continues
to be present in the standard, ambiguously used for any of several
rather distinct usages, whereas U+2010 HYPHEN is intended for only
one of those.

Furthermore, the first dash is U+002D being used for an en dash,
rather than a minus sign. And there is nothing preventing me from
continuing to elaborate on the multiple uses:

"Conveyed in this time-tested 8859-1 e-mail on 3-2-2005 at 17:20GMT -8."

Now those happen to all be 0x2D in this e-mail, but in Unicode I could
either convert all those 0x2D's to U+002D, or I could choose to
sort them all out, using en dashes, minus signs, and hyphens, as
appropriate.

And users being who they are, anyone, including me, could end up
making incorrect choices as well, so that I might end up with
an EN DASH in "time-tested" and a MINUS SIGN in "8859-1", for example.

The interpretation and identity of the *encoded characters* in any
of these instances depends on the standard itself, and is one
issue.

The interpretation of the function of the punctuation and symbols
in textual context is quite another.

So I can say things like, "The '-' in the word 'time-tested' is a
hyphen." And that can be correct, even if I happened to represent
the text in a Unicode data store as U+2212 MINUS SIGN.

Ambiguity in the interpretation of text in context is quite a
different thing than ambiguity in the interpretation of an
encoded character in the standard.

> >Lost me again. Be patient. Are you saying that the presence of e.g.
> ><hyphen> in a string of text somehow affects the meaning of
> ><hyphen/minus>?

It does not.

>
> It affects the number of choices available - the presence of a hyphen
^^

The presence of a U+2010 HYPHEN in a string of text does not
affect the number of choices available.

The historical fact of the disunification of U+002D HYPHEN-MINUS
by the additional encodings of U+2010 HYPHEN and U+2212 MINUS SIGN
(and U+2013 EN DASH) affects the number of felicitous representations
of a particular piece of text with a '-' in it.

> adds one more choice for interpretation; its absence leaves you guessing
> as to its usage (is hyphen/minus being used contrastively with hyphen or
> not?).

And this confuses the issue. It isn't the presence of the alternatives
in the standard that controls this -- it is the usage in text which
controls it.

--Ken
^^
A sequence of two <002D, 002D>, not used contrastively with em dash.
(Note that Dean himself, in his e-mail, uses a single U+002D as
an em dash.)

;-)
^ oops, there's another one -- the HYPHEN-MINUS SMILEY NOSE

Next message: Kenneth Whistler: "Re: Unicode Stability (Was: Re: E0000 Language Tags for Some Obscure Languages)"
Previous message: Dean Snyder: "Re: Ambiguity and disunification"
Maybe in reply to: Dean Snyder: "Re: Ambiguity and disunification"
Next in thread: Dean Snyder: "Re: Ambiguity and disunification"
Reply: Dean Snyder: "Re: Ambiguity and disunification"
Reply: Patrick Andries: "double hyphen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 15:03:23 CST