RE: character semiotics (was RE: Mixed up priorities)

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Thu Oct 21 1999 - 21:07:00 EDT


Hi Andrea,

> -----Original Message-----
> From: A. Vine [mailto:avine@eng.sun.com]
> Sent: Thursday, October 21, 1999 6:16 PM
> "Reynolds, Gregg" wrote:

> But, Gregg, what is meaning, after all? Is 'f' a semiotic
> unit to you? Is
> 'I'? Does 'I' hold a greater significance than 'f' because
> it has another
> meaning? Why is 'f' encoded and not 'if'? If 'if' were what
> the average
> English speaker would identify as a single letter, is it
> sufficient to say it's
> encoded as 'i' + 'f'?
>

Good questions, for which I think we can come up with workable (if not
theologically and cosmically "true") answers. By workable I mean something
along the lines of "a system of terms, definitions, etc. that serves to
closely model the 'real' semiotics of written language (and thus answer to
the expectations of literate communities) for the purposes of formal
language design and software specification (thus answering to the
expectations of software vendors)". I think it is possible to agree on a
fairly precise set of formal definitions for modeling written language by
drawing on linguistics, semiotics, mathematical logic, etc. Stuff that's
been around for quite some time, actually.

The first thing I would note is that our ordinary means of discourse is
incredibly impoverished when it comes to talking about written language. We
have quote marks and and various typographic conventions and that's about
it. Very flexible, but not very precise. So take for example your question
"is 'f' a semiotic unit": you could be referring to the graphical thingee,
or the phonological thingee, or a third thingee, or maybe even the
sign-function thingee that ties some or all of these other thingees
together. Etc. (I would answer yes in each case.) Great fun, actually.
And I'm not picking on your usage; examine almost any piece of writing by
any specialist that discusses grammatology, and you'll find it shot through
with informal usage that relies on the reader to figure out which register
to use in interpreting things like 'f' (or should that be "'f'"?). Watch
how often it happens on this list.

In any case, to answer your questions, I would start by positing that we
need to model two things at least, one being the visual aspect of written
language (graphemes, visual syntax, etc.) (i.e. the signifiers), and the
other being the things denoted by such forms. I think Unicode works on the
former, not the latter. I don't have a good term for the latter yet, but
for now let's call them "grammemes". ("Cultural unit" is a tad too general
and would cover just about everything. I guess we could go for the TLA:
GCU = grammatical cultural unit. Wheee!)

Grammemes are not phonemes. Research has shown that reading does not
necessarily involve phonological activity in the brain. (If you're
interested I can supply the references). The set of grammemes associated
with a particular written language amounts to a theory of language. They
represent the cognitive categories literates use to think about language,
and don't necessarily follow modern linguistic analyis. "Grammeme" because
the line between basic units such as "letters" in the traditional sense and
higher-level grammatical concepts is blurry in some languages. Arabic
provides several examples, ta marbuta being the most obvious. Either a
medial ta form or a final dotted heh form may represent ta marbuta in
Arabic, but the name "ta marbuta" itself denotes a complex packaging of
rules relating phonology, morphology, and syntax. It is not considered an
element of the traditional Arabic alphabet, but it is definitely part of
basic Arabic orthography and literacy - one should be able to search on it,
for example. So it's a grammeme.

I seem to have slipped into dissertation mode again. Sorry 'bout that. To
get back to your questions, I would say that by 'f' we designate a pairing
of graphic form and grammeme - a sign-function, in semiotic terms. 'I' is
another; the fact that it can enter into other semiotic (lexical) relations
can be disregarded, since our guide is the set of 'letter' grammemes
associated with (pick your language.) 'if' is not encoded because the
community of literates doesn't think of the graphic form as denoting a
single irreducable grammeme - if it did, then it would merit a code point,
as 'ch' in some languages surely does.

This does not mean that the graphical form used to represent it cannot be
analyzed into consituent parts that are themselves encoded. It would not be
problematic to say that the grammeme 'if' may be represented visually by the
sequence of two _graphemes_ 'i' and 'f'. But "grammeme i" plus "grammeme f"
does not equal "grammeme if" though they might equal "lexeme if" - that
would be for higher level protocols to decide. U+0BCA TAMIL VOWEL SIGN O, I
am willing to bet, is considered by Tamil literates a single form denoting a
single grammeme. But it would be entirely reasonable to analyze the form
used to denote that grammeme into its constituent parts and encode them
separately _qua graphic forms_ without a corresponding grammeme denotatum.

> Is Unicode's lack of capturing the semiotics of written
> language a by-product of
> its philosophy of characters,

I think so. Also of its notion of plain text, and the whole underlying
notion of "script without language". It's not the worst idea in the world,
but it comes at a cost, and I've never seen a real careful analysis of what
we (well, not me and my pals but certainly others) give up by adopting
Unicode's modeling strategy.

> or a result of the restrictions
> imposed on it by
> existing computer systems and software?

Must have had a lot to do with it. But on the other hand, I don't think a
more balanced approach would necessarily mean software designs incompatible
with today's software.

If it were a question of standardizing widget interfaces it wouldn't matter
much, but we're talking about standardizing a model of language, which is
pretty close to home for everybody.

Add another possible cause: specialization. Very few people are insane
enough to try to master the disparate fields (computer science, mathematical
logic, linguistics, textual theory, psycholinguistics, etc etc) that
converge here. Most of the people in the humanities with whom I've
discussed Unicode have almost no clue as to what plain text is, let alone
how formal modeling works. I don't mean that the people involved are not
qualified, only that the pool is pretty small.

Cheers,

Gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT