Re: Arabic - Alef Maqsurah

From: Peter_Constable@sil.org
Date: Thu Jul 15 1999 - 16:02:17 EDT


>But this begs the question. They don't encounter particular morphologies; they
encounter particular encodings. Encodings, natural and artificial, always
reflect some theory of language. Change the encoding and you change the
problem.

..

>On the contrary, you cannot *not* build morphological structure into an
encoding. Unicode already does: lexemes are built by concatenating text atoms.
Works great for English, not so great for e.g. Arabic. How else can one explain
the space "character" as a positive element? Even for Arabic Unicode
accomodates some level of morphological intelligence: "contextual shaping"
encodes morphology (prosodic word boundary). Every "natural" encoding of
language into visual form does the same to some extent. It's not a question of
whether, but of how much.

I think this fails to recognise, in the general case, the distinctness between
writing as a representation of language, and language itself. While they are
related (Richard Sproat is working on a book in which he makes specific claims
about the relationship between a writing system for a language and the phonolgy
of the language), they are clearly different, and the writing sytem can have
behavious that are completely independent of the language which is represented.
Some examples:

- line direction is a purely visual phenomenon with no connection to language
- non-linearity (e.g. Indic scripts, Pahawh Hmong - these demonstrate that it's
possible for writing to reflect more that phonemes, but they also demonstrate an
independence between the spatial sequence of characters and the temporal
sequence of the phonemes represented)
- Arabic contextual forms: while the typical behaviour is that the contextual
shaping reflects prosodic word boundaries, the fact that this is not done
consistently clearly indicates that the script is independent of the language
- "word" spacing in Kayah Li: if I understand correctly, the spaces between
written words *do not* corresponding with morphological or phonological words

In the history of computing, text encodings have always been encodings of
writing. The fact that these encodings can be used to support operations of
linguistic interest is due to the degree to which a given writing system for a
particular language reflects the linguistic structure of that language, and is
not due to the encoding being an encoding of linguistic information. This can be
clearly seen from the use of 8859-1 (e.g.) for encoding English and Spanish
texts: from this encoding, the text alone can give a reasonably close
approximation of pronunciation for Spanish, but not for English. This is because
of the relationships between writing and phonology for these two languages, and
is independent of the encoding.

It would certainly be possible to design a text encoding for a language (but not
a script) which has a strong basis in the linguistic structure of that language,
but this might have to be done in a conscious departure from traditional
encoding practices. One alternative, though, would be to enrich the script-based
encoding for a given language with additional characters or meta-text markup
that provide linguistic information beyond that contained in the writing. This
would almost certainly not be appropriate for general use for encoding text in
that language, but may be very useful and appropriate for certain contexts.

>Understood, but my view is that this is where Unicode itself gets a little
confused. Does it or does it not encode presentational (visual) form? Arabic
presentational forms (by which I mean all letterforms used in writing) are
indeed used in many languages from different families, but do these
presentational forms share the same character semantics across languages? I
sincerely doubt it. So an encoding that works across languages must sharply
distinguish between character semantics and presentational form. Which gets us
back to grammatical encoding.

It should be pointed out that encoding based upon writing is not identical with
writing based upon visual form, in the sense of presentation forms. Encoding can
be based upon presentation forms, but they need not be: they can be based upon
an abstract view of the writing in which there is a notion of "character" which
is distinct from presentation forms, which is precisely the approach of Unicode.
Forgive me for stating the obvious, but I needed to be sure this point wasn't
overlooked.

There is a point within what Greg is saying here, though: the behaviour of a
script may not be precisely the same in each of the languages which have writing
systems based upon that script. E.g. some minority languages in Thailand are
written such that there *is* a space between words. Thus a complete system for
dealing with multilingual text must support a notion of "writing system" (which
it may be possible to equate with locale). I don't know that I'd agree, though,
that this means encoding must include grammatical information.

I'm discussing all of this in general terms. I'll leave it to others to consider
how it applies to the current discussion of Arabic. I hope it helps.

Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT