RE: Mixed up priorities

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Sat Oct 23 1999 - 15:51:08 EDT


> -----Original Message-----
> From: arno [mailto:arno@zedat.fu-berlin.de]
> Sent: Saturday, October 23, 1999 1:44 AM
> SEMANTICS?
>
> An other example that Unicode is NOT about semantics, rather
> about abstract shapes (not to be confused with glyphs)
> is U+017F "long s". long-s is clearly a form, a shape of a letter.
> The semantic name _would_ be "non final s". Both in German Fraktur
> and Antiqua (and in English for some years before and after 1900)
> a "non final s" was rendered by some typographers (in some fonts)
> as round-s, in other fonts as long-s -- final-s being always
> rendered by
> round-s.
>

I hadn't realized this was encoded. It's clearly an alternate graphic form
with [[s]] semantics, whose use is governed by a higher level syntactic
protocal. But its compatibility decomposition is marked as U+0073, s, so
I'm guessing it was in a previous standard that had to be included. Good (I
need it for a project. ;)

The cases that show this most clearly usually (always?) involve
non-graphical information supplied by the reader. This is the case for
Arabic ta marbuta just as it is for Slovak "ch". In both cases, there is a
distinct semantic unit that has no graphic marker - the reader must have
sufficient grammatical knowledge to decipher the text in order to identify
the sign-function at work. If it were at the level of morphemic or lexical
analysis - e.g. "that's an active participle" then it would clearly be
beyond the scope of a "character" encoding. But in these two cases (no
doubt there are many others) the semantic unit is at a very low level, on
the same plane as the explicitly graphed semantic units. To me that is a
very strong argument for encoding semantics; but such an encoding would
undeniably run counter to Unicode's (un)stated philosophy, at least as I
understand it. Defenders of the faith need not jump to its defense; I can
live with Unicode as it is, I'm merely interested in some areas where it's
not clear that it works very well.

Another suggestion to help disambiguate Unicode's official terminology: use
"alloglyph" where Unicode uses "glyph". Use "glypheme" to denote the set of
all alloglyphs with the same identity; this is essentially the meaning of
Unicode's term "character". In another note I suggested // // as "graphemic
brackets" to denote the form of a sign. This can be refined to distinguish
between glypheme and alloglyph. Use /A/ to denote an alloglyph; use single
squiggle brackets (set notation) to denote glypheme: {A}. Double square
brackets to denote the meaning of the glypheme (which by functional
composition is the meaning of each alloglyph in the glypheme): [[A]].

Summary:

        [[A]] = a semantic unit in the grammar of a language corresponding
to the name 'A'. "Grammeme" is the best (or least yucky) name I've thought
of so far.

        /A/ = an alloglyph; some specific (but non-specified) and unique
graphical form that would be identified by a reader as an identifiable
example of a glyph named 'A'.

        {A} = a glypheme; the set of all alloglyphs identified by the name
'A'; the abstract concept or magic that allows us to talk about a
multiplicity of unique shapes as having the same identity (or something like
that).

I've been using something like this set of terms an notations to think about
how writing works for some time now, and it seems to work pretty well, at
least for me. I propose it just in case somebody else out there might find
it useful. Or maybe somebody can point me to a similar or better apparatus
- I've been looking in the library and haven't found one yet.

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT