From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 02 2005 - 15:11:40 CST
> ... Note also that teh marbuta is not
> traditionally considered a first-class letter in the abjadia; instead is
> is a clever solution to the problem that a single character (in the deep
> orthography, if that's the right term) takes two completely different
> pronunciations depending on context. I suppose the linguists have a
> word for this sort of thing;
Yes. Morphophonology. In this particular case you are apparently
talking about an underlying unit of a morpheme which takes
one phonemic representation in one morphological context and
another phonemic representation in another morphological context.
> to me it looks like teh marbuta makes
> explicit a feature of deep orthography, or morphology, or in any case
> it's semiotics (can you tell I'm grasping here?) differ from those of
> the "normal" letters. This is in Arabic; I dunno about Persian, etc.
But how you analyze the phonology and morphology of Arabic (or any
other language which happens to use the Arabic script for writing)
is basically irrelevant to the character encoding. The character
encoding encodes the visible units of the writing system (the
graphemes and occasionally subgraphemic pieces, allographs, and
such). Whether U+0629 ARABIC LETTER TEH MARBUTA is right-joining
or dual-joining depends on how *that* letter is connected
cursively in the script, as traditionally treated in the legacy
Arabic encodings. It does not depend on whether one can argue
that teh marbuta can be identified as some morphophoneme that
at a deeper letter "really is" a teh that can be connected on
either side.
Arabic joining in ArabicShaping.txt is about visible cursive
joining rules for the writing -- not about morphophonological rules.
>
> In other words, it would be useful to encode the *character* teh
> marbuta, as understood in Arabic tradition. So e.g a search for
> risala# should match risalat*kum, and when the -kum is deleted in an
> editor the software knows the shape of the # should revert to the
> heh-like shape.
Would it also be "useful" to encode a *character* for the Latin
script for English that captures the following significant
morphophonological alternation?
/@#tejp/ "a tape"
/@#nejp/ "a nape"
/@n#ejp/ "an ape"
(Where "@" is a schwa, and "#" is a morphological juncture marker.)
... so that a search for the indefinite article in English finds
both "a" and "an" simply by matching on the characters?
This kind of issue is beyond what a character encoding should be
concerned with. The *characters* here are simply "a" and "n",
used in the Latin writing system. The identity of the morphological
and phonological units for English (or Arabic) is instead an
issue for morphological analytic and stemming systems -- not
something to be resolved via character encoding.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Mar 02 2005 - 15:14:02 CST