> Forgive me if you've already addressed this
It was a while ago!:
| Date: 10 Nov 95 14:41:36 PST (Friday)
| Subject: Re: More Arabic
| In the case of TEH MARBUTA, as a phenomenon it too is solely final.
If a word which would have ended with TEH MARBUTA is extended with
grammatical endings, the typist must replace TEH MARBUTA by an ordinary TEH;
i.e. the encoding is designed to require such replacement, rather than
having the TEH MARBUTA mutate into the *appearance* of a TEH. This
preserves the uniqueness of the correct spelling.
The point is that the encoding is chosen to *model* (I like your term) the
linguistic / orthographic realities in one way or another, with design
choices being made to meet encoding constraints. In this case, the
convenience of providing an (automatic) medial-TEH form for TEH MARBUTA
would be balanced by the cost of creating duplicate encodings for all forms
like "risAla#uhu / risAlatuhu".
This tradeoff could have been made the other way, but ultimately all that
matters is that a single encoding *convention* become adopted in practice.
My understanding is that TEH MARBUTA is treated in this manner by the ASMO
449 and ISO 8859-6 standards that were the source for the Unicode Arabic
set, i.e. that "risAla#uhu" is conventionally encoded as "risAlatuhu" in
on-line Arabic text.
> a search for a word like risAla# should return all forms of the word ...
> Unicode would not support this
You correctly noted the analogy to upper/lower case letters, so the
analogous statement would be
a search for a word like cat should return all forms of the word (incl.
Cat and CAT) ...
ASCII would not support this
These statements are category errors: the encodings of course support any
text processing you care to program.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT