From: Hans Aberg (firstname.lastname@example.org)
Date: Fri Feb 18 2005 - 12:22:14 CST
At 14:49 -0800 2005/02/17, D. Starner wrote:
>> Now mix capitalization in the bag: In natural languages, capitalization
>> typically does not alter the semantics of the word.
>That's not invariable; there are rare cases where capitalization alter
>semantics. For example, Poles and poles are two different things. More
>importantly, capitalization alters the meaning of the sentence and paragraph.
It depends: If people cannot parse the sentence "he was a pole", or the
sentence in all-caps "HE WAS A POLE".
It is just an illustration that human writing relies on several principles,
and it can be difficult to design a logical model around it.
>> capitalization can be used to communicate certain semantic information:
>> Start of sentence, proper noun, (in German) noun, abbreviation, etc. If one
>> sticks to the semantic approach, then one should add abstract characters
>> "start of sentence", "proper noun", etc., zip out say the uppercase letters,
>> and let the rendering machine make a correct presentation.
>But that's not the complete list. There is a practically unlimited variety
>of things that capital letters have been used for; any such list of characters
>would be insufficent. Do you put the "noun" character before all nouns, just
>in case we want to render this in a early modern English font that capitalizes
>all nouns? What about pH? To try and handle capitaliztion like this may be
>suitable for an English professor marking something up in TEI-Lite, but it's
>shear madness for a character encoding standard. For all extents and purposes,
>treating capital letters and small letters as different is the only sane way
>to go, and I know of no character set that has done otherwise (and presumed
>to support both.)
I just wanted to illustrate the underlying principles and how complicated it
can be when one tries to draw them to an logical end. Unicode is a pragmatic
tradeoff between different principles. If one becomes aware of these
principles, it might be easier to understand these tradeoffs as well. Most
of the characters are probably added pragmatically to Unicode based on
There seems to be at two principles involved: The semantic and the graphic.
If drawn to its logical end, one should perhaps have at least two character
sets, one for the correct semantic representation, and another, for enabling
a correct graphic representation.
In such a logical model, one would probably have to add abstract characters,
enabling names like "pH". But if analyzed more carefully, the "H" is the
chemical name for hydrogen, and "p" I think the mathematical name of the
negative logarithm of the concentration of what follows. So the exact
semantic character sequence would be perhaps (in pseudocode)
One would want to keep the uppercase/lowercase combination, not because it
is an English representation, but because it follows the math/chem
notational rules. If writing text in a program like TeX, or perhaps a
better, future, successor, such semantic values will be of course added in
the input, so that the rendering output can be correctly computed. So your
example, in fact gives a good hint on how much extra work would be needed if
one would strive for a correct semantic representation.
This archive was generated by hypermail 2.1.5 : Fri Feb 18 2005 - 12:45:42 CST