From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Jul 06 2004 - 05:09:41 CDT
On 03/07/2004 05:52, Donald Z. Osborn wrote:
>I've read selected messages in this thread (on Unicode list) and some messages
>bring to mind the thought of developing routines or standards to permit
>toggling back and forth between standard Latin and Arabic transcriptions for
>the same language, such as between the Boko and Ajami writing of Hausa. (Same
>applies to any two or three transcription systems used for particular
>languages.)
>
>
There are many languages in Asia which are written in Latin, Cyrillic or
Arabic script, or at least two of the three, and the same kinds of
problems apply to them. (There are also languages written in Arabic and
Indic scripts, but I don't know enough about these to be helpful.)
Toggling between Latin and Cyrillic scripts is relatively easy, although
in some languages it is complicated in that single Cyrillic letters are
used for two Latin ones, sometimes dependent on context e.g. Cyrillic e
becomes Latin ye word initially but just e in other positions. Most of
these conversions can be programmed easily, although there is a small
problem with the new Uzbek Latin alphabet, deliberately based on ASCII
only plus apostrophe serving as a diacritic, for sh, ch and gh are
usually digraphs but in principle can be separate letters cf. cathode
vs. cathouse.
Changing in and out of Arabic script is much more complicated. The main
issue is that Arabic loan words (which are common in most of these
languages) usually have to be spelled exactly as in Arabic (oddly,
except for TEH MARBUTA which becomes either TEH or HEH) even though many
of the distinctions are lost in pronunciation and therefore in Latin and
Cyrillic script. It is therefore impossible to transliterate
automatically into Arabic script - even with an extensive dictionary
there are potential ambiguities. The reverse direction is problematic
mainly because Arabic script does not have a standardised way of marking
all of the vowel distinctions made in Latin and Cyrillic, and anyway
certain vowels are often not written at all.
>...
>
>Because there is generally not a 1-to-1 character correspondence in spellings in
>different transcriptions, I wonder if you don't end up having to consider
>something that operates a bit like machine translation, analyzing the context
>of words in cases where transcription of a word in one system could be
>transliterated into something misspelled or taken as more than one word in the
>other system. Necessarily, I think, such routines would have to be
>language-specific.
>
>
From my experience, if Arabic script is involved, they would certainly
have to be language-specific, and to achieve a correct result they would
also need to be rather intelligent, or rely on human intervention. As a
very simple example, "Sudan" in Turkish or Azerbaijani can be the name
of a country or it can mean "from water", and the correct Arabic
spellings are likely to be very different, and can be disambiguated only
by complete parsing of the context.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Tue Jul 06 2004 - 05:10:43 CDT