Re: Ambiguity and disunification

From: Dean Snyder (dean.snyder@jhu.edu)
Date: Fri Mar 04 2005 - 07:30:11 CST

Next message: Arcane Jill: "Re: Small Java implementation of NFC"

Previous message: Markus Scherer: "Re: Bad Content-type headers on Unicode web site?"
In reply to: Kenneth Whistler: "Re: Ambiguity and disunification"
Next in thread: Patrick Andries: "double hyphen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kenneth Whistler wrote at 1:02 PM on Thursday, March 3, 2005:

>Adding
><hyphen> to the repertoire does not change the meaning of <hyphen/minus>,
>nor does it change the interpretation of text which may have used
>a <hyphen/minus> character before a distinct <hyphen> was encoded.
>
>What we have ended up with is 3 characters. One of those is the
>legacy character that represented an encoding compromise very early
>on (itself derived from a typewriter keyboard design limitation)
>which reflected a willingness to put up with ambiguous usage that
>didn't reflect actual typographical practice, in order to gain
>the manifest benefits of typewriters (and later computers and
>digital text representation).
>
>These 3 characters are now distinct in Unicode and have distinct
>interpretations and properties:
>
>U+002D HYPHEN-MINUS gc=Pd, bc=ES, lb=HY
>
>U+2010 HYPHEN gc=Pd, bc=ON, lb=BA
>
>U+2212 MINUS SIGN gc=Sm, bc=ES, lb=PR

[Here Ken follows with several playful examples of multiple, ambiguous
uses of various dashes:]

All of which either misses or ignores the point of my example, where I
stated very clearly, "If only hyphen/minus and hyphen have been encoded"
- a hypothetical scenario that unhappily precisely matches what Unicode
is actually, and inconsistently, doing in some instances (Hebrew and
cuneiform), but not in others (the dashes).

It is, of course, very significant that Unicode, correctly, DID NOT do an
incomplete disambiguation of the dashes (keeping as they did the original
ambiguous one while adding the disambiguated usages), but sadly they are
proceeding with incomplete disambiguations in Hebrew and cuneiform. And
any number of playful or whimsical dash examples do not negate this sober
reality.

By the way, I have only been using hypothetical dash examples because
others started with them first and many here are not familiar with the
arcana of the actual (soon-to-be) encoded Hebrew and cuneiform examples
that fit the scenario under discussion.

Nevertheless, despite all of Ken's supposed counter examples of dash
usage my point still holds. I will try to spell it out (still using dash
examples) so explicitly that one cannot miss the point.

In a single plain text passage, presuming an incomplete encoding where
only hyphen/minus and hyphen are encoded, if an author meticulously uses
hyphen/minus for minus but hyphen everywhere else, one presumes he is
using them contrastively thereby following the raison d'etre for the new,
partial-disambiguation encoding model itself.

If you however cut from that text the phrase "2-3", i.e. "2 hyphen/minus
3", and place it into a context where only hyphen/minus is used (a
context that ignores the new encoding model) you have now lost the
original author's contrast and will not know in the new text, without a
context-bound analysis, whether or not this phrase should be interpreted
as "2 to 3" or "minus 1". What will you do now if you need to interpret
that phrase before entering it as a value in a spreadsheet or database?
You will need to do a context-bound analysis, perhaps even a human one at
that.

But it will be argued that, at least in the original document, the
correct values can be determined programmatically, unambiguously, and by
context-free processes.

Not so fast. This is where the insidiousness of the partial
disambiguation encoding model rears its ugly head.

The problem comes when we have an author who, although he meticulously
follows the new, incomplete disambiguation encoding model and always uses
hyphen/minus for minus and hyphen for everything else, happens to produce
a text in which there is no hyphen. Unless you know the author and are
familiar with his practices, or are informed by someone, or do a context-
bound analysis of the text itself, you will not know that the author has
meticulously followed the new model and there are indeed no hyphens in
this text.

In a sense that makes the partial disambiguation style of encoding even
worse than just leaving the original ambiguities in place, because you
have to live with the uncertainty as to the trustworthiness of a
particular text unless there is at least ONE contrastive usage in place.
That's what I mean when I say that the ambiguity is compounded by an
incomplete disambiguation encoding model.

Happily and wisely, though, Unicode did not do this with the dashes -
they should follow their own precedent and not incompletely disambiguate
Hebrew, cuneiform, etc.

Respectfully,

Dean A. Snyder

Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/

Next message: Arcane Jill: "Re: Small Java implementation of NFC"
Previous message: Markus Scherer: "Re: Bad Content-type headers on Unicode web site?"
In reply to: Kenneth Whistler: "Re: Ambiguity and disunification"
Next in thread: Patrick Andries: "double hyphen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 11:12:35 CST