Re: ta' marbuta

From: Mark Davis (mark@macchiato.com)
Date: Wed Aug 25 1999 - 12:28:44 EDT


If you look at
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeCharacterDatabase-3.0.0.beta.html,
you will find that "These properties are normative for minimal shaping of
Arabic and Syriac."

It is not normative for more than minimal shaping. More sophisticated
typography can vary the shape, ligatures and joining properties of characters.
However, if you try to treat ta marbuta as if it had a medial form then because
essentially every other system and application will not, it will look like and
be a misspelling if you transmit your data to those other systems.

While there are features of Unicode that we would have done differently--had we
to do them over again--the structure of the encoding must provide a balance
among many different factors: implementation requirements, the behavior of
characters when mapped back and forth between legacy character sets, linguistic
requirements, and so on.

Mark

"Reynolds, Gregg" wrote:

> Dear Joseph,
>
> Thanks for your response. Alas, I don't think it addresses the issues I
> would like to raise: I think the passage you quote misinterprets the
> semantics of ta marbuta. However, as a courtesy to those on the list who
> consider such discussions more noise than signal, I'll put the full
> explanation of it on a web page, along with the text of a standard grammar,
> and send the url to the list.
>
> You advance one argument that I would like to address, however, and that is
> the CAT/cAt/cat/etc distinction, the handling of which you seem to claim is
> an implementation issue. I would point out that case distinctions are
> normative in Unicode, so we know the semantics regardless of implementation
> behavior. It's not clear to me if Joining Class is normative or not:
> presumably it is, though it is not listed in the table at the beginning of
> Chapter 4. But in this case, it looks to me like Unicode positively
> prohibits proper interpretation of ta marbuta in Arabic langauge texts!
> Have I misunderstood something about Unicode here?
>
> -gregg
>
> > -----Original Message-----
> > From: Becker, Joseph [mailto:Joseph.Becker@pahv.xerox.com]
> > Sent: Monday, August 23, 1999 3:06 PM
> > To: Unicode List
> > Cc: Unicode List
> > Subject: RE: ta' marbuta
> >
> >
> >
> > > Forgive me if you've already addressed this
> >
> > It was a while ago!:
> >
> > | Date: 10 Nov 95 14:41:36 PST (Friday)
> > | Subject: Re: More Arabic
> > |
> > | ...
> > |
> > | In the case of TEH MARBUTA, as a phenomenon it too is
> > solely final.
> > If a word which would have ended with TEH MARBUTA is extended with
> > grammatical endings, the typist must replace TEH MARBUTA by
> > an ordinary TEH;
> > i.e. the encoding is designed to require such replacement, rather than
> > having the TEH MARBUTA mutate into the *appearance* of a TEH. This
> > preserves the uniqueness of the correct spelling.
> > |
> > | ...
> > |
> >
> > The point is that the encoding is chosen to *model* (I like
> > your term) the
> > linguistic / orthographic realities in one way or another, with design
> > choices being made to meet encoding constraints. In this case, the
> > convenience of providing an (automatic) medial-TEH form for
> > TEH MARBUTA
> > would be balanced by the cost of creating duplicate encodings
> > for all forms
> > like "risAla#uhu / risAlatuhu".
> >
> > This tradeoff could have been made the other way, but
> > ultimately all that
> > matters is that a single encoding *convention* become adopted
> > in practice.
> > My understanding is that TEH MARBUTA is treated in this
> > manner by the ASMO
> > 449 and ISO 8859-6 standards that were the source for the
> > Unicode Arabic
> > set, i.e. that "risAla#uhu" is conventionally encoded as
> > "risAlatuhu" in
> > on-line Arabic text.
> >
> > > a search for a word like risAla# should return all forms of
> > the word ...
> > > Unicode would not support this
> >
> > You correctly noted the analogy to upper/lower case letters, so the
> > analogous statement would be
> >
> > a search for a word like cat should return all forms of the
> > word (incl.
> > Cat and CAT) ...
> > ASCII would not support this
> >
> > These statements are category errors: the encodings of course
> > support any
> > text processing you care to program.
> >
> > Joe
> >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT