Re: teh marbuta

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 02 2005 - 15:11:40 CST

  • Next message: Peter Constable: "RE: But E0000 Custom Language Tags Are Actually *Required* For Use By Unicode"

    > ... Note also that teh marbuta is not
    > traditionally considered a first-class letter in the abjadia; instead is
    > is a clever solution to the problem that a single character (in the deep
    > orthography, if that's the right term) takes two completely different
    > pronunciations depending on context. I suppose the linguists have a
    > word for this sort of thing;

    Yes. Morphophonology. In this particular case you are apparently
    talking about an underlying unit of a morpheme which takes
    one phonemic representation in one morphological context and
    another phonemic representation in another morphological context.

    > to me it looks like teh marbuta makes
    > explicit a feature of deep orthography, or morphology, or in any case
    > it's semiotics (can you tell I'm grasping here?) differ from those of
    > the "normal" letters. This is in Arabic; I dunno about Persian, etc.

    But how you analyze the phonology and morphology of Arabic (or any
    other language which happens to use the Arabic script for writing)
    is basically irrelevant to the character encoding. The character
    encoding encodes the visible units of the writing system (the
    graphemes and occasionally subgraphemic pieces, allographs, and
    such). Whether U+0629 ARABIC LETTER TEH MARBUTA is right-joining
    or dual-joining depends on how *that* letter is connected
    cursively in the script, as traditionally treated in the legacy
    Arabic encodings. It does not depend on whether one can argue
    that teh marbuta can be identified as some morphophoneme that
    at a deeper letter "really is" a teh that can be connected on
    either side.

    Arabic joining in ArabicShaping.txt is about visible cursive
    joining rules for the writing -- not about morphophonological rules.

    >
    > In other words, it would be useful to encode the *character* teh
    > marbuta, as understood in Arabic tradition. So e.g a search for
    > risala# should match risalat*kum, and when the -kum is deleted in an
    > editor the software knows the shape of the # should revert to the
    > heh-like shape.

    Would it also be "useful" to encode a *character* for the Latin
    script for English that captures the following significant
    morphophonological alternation?

    /@#tejp/ "a tape"
    /@#nejp/ "a nape"
    /@n#ejp/ "an ape"

    (Where "@" is a schwa, and "#" is a morphological juncture marker.)

    ... so that a search for the indefinite article in English finds
    both "a" and "an" simply by matching on the characters?

    This kind of issue is beyond what a character encoding should be
    concerned with. The *characters* here are simply "a" and "n",
    used in the Latin writing system. The identity of the morphological
    and phonological units for English (or Arabic) is instead an
    issue for morphological analytic and stemming systems -- not
    something to be resolved via character encoding.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Mar 02 2005 - 15:14:02 CST