character semiotics (was RE: Mixed up priorities)

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Thu Oct 21 1999 - 18:36:24 EDT


> -----Original Message-----
> From: Michael Everson [mailto:everson@indigo.ie]
> Sent: Thursday, October 21, 1999 3:07 PM
> To: Unicode List
> Subject: Re: Mixed up priorities
>
>
> Ar 12:44 -0700 1999-10-21, scríobh G. Adam Stanislav:
>
> >I find it ridiculous that when I suggested to treat 'CH' as
> a character -
> >which it is in Slovak, Czech, and several other languages -
> I was swamped
> >with the reasons why that should not be the case, but at the
> same time it
> >is apparently OK to encode fictional "alphabets" such as
> Klingon in Unicode.
>
> But you are wrong. CH is not a _character_ in any language.
> It is a set of
> strings of characters (C-H, C-h, c-h) used (sorted etc.) as a
> _letter_ in
> languages like Slovak, Czech, Welsh, and traditional Spanish.
>

_character_: ah, there's the rub. For the longest time I labored under the
mistaken notion that Unicode's notion of _character_ meant the _semantics_,
and not abstract shape. (After all, the definitions in the text imply that
pretty strongly.) But recently in a private communication a prominent
member of the Unicode community clarified this for me, and Michael's note
confirms it; it seems that when Unicode says "character" it means "abstract
shape", a kind of Platonic Idea that somehow unifies the shape or shapes of
a "character". Not to be confused with the meaning of those shapes.

This is the only way I can make sense of Michael's explanation here. I
don't know any Slovak, Czech, etc., so I have a question for those who do:
is the denotation of 'CH' a single thing? If so, does the thing have a
name?

This is also the only way I can make sense of the notion of "part of a
character" as described in
http://www-4.ibm.com/software/developer/library/utfencodingforms/index.html:

"In other cases a single character must correspond to two glyphs, because
those two glyphs are positioned around other letters. If one of those glyphs
forms a ligature with other characters, then we have a situation where part
of a character corresponds to part of a glyph. If a character (or any part
of it) corresponds to a glyph (or any part of it), then we say that the
character contributes to the glyph." (followed by a small table showing a
Tamil vowel "character" that in use splits into two "graphemes" bracketing a
consonant)

So is anything wrong with such an approach? Well, in my opinion, yes, very
wrong. I don't think it captures the semiotics of written language very
well at all. I confess I am unable to imagine any member of any literate
community who would agree that characters are divisible, except on the
understanding that "character" is a purely graphical notion. And it seems
to me that if a literary tradition says that the semantics of "dgwej$#^ag;j"
is a single grammatical/cultural semiotic unit (for lack of a better term),
then that is how it should be encoded.

Why should not the pattern of shapes "CH" (we don't have an appropriate
metalanguage; I hope you see what I mean) be construed as a glyph variant
for the cultural unit whose name is "CH"?

Note that such questions are distinct from the question of whether that
cultural unit should be encoded as a single integer or a sequence of two or
more integers; that question is not relevant to the more fundamental
ontological question of what things should be identified for encoding.
Maybe what is needed is a new Unicode table that maps sequences of existing
characters to names, thereby providing an official encoding for semiotic
units without adding to the repertoire of precomposed "characters" which
seems to put certain knickers in a twist. We could say CZECH CH == U+0063 +
U+0068. Everybody wins: Czechs get the encoding of a semantic unit
meaningful in their culture, and Unicode remains minimal.

In the case of characters like the so-called "split characters" in Tamil and
other languages, would it not make more sense to describe the semiotic
system as composed of a graphic pattern with certain structural
characteristics whose denotation is always a single grammatical/cultural
unit whose name is (foobar)? That way we avoid the (to me anyway)
exceedingly ugly notion of parts of characters.

I guess in a word I'd argue that we still have a fundamental ambiguity
regarding the meanings of terms like "character" and "glyph"; the
metalanguage of Unicode, rather than clarifying things, actually rather
muddies them. IMHO.

Sincerely,

Gregg

P.S. I'm scratching my way through Eco's "A Theory of Semiotics" which is
where I get the term "cultural unit" as a kind of fundamental unit of
meaning. I don't have a better term at the moment. And Unicode is about
semiotics, not about "text processes".



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT