Transliteration in Asia, was Re: Hausa: Boko<->Ajami?

From: Peter Kirk (
Date: Tue Jul 06 2004 - 05:09:41 CDT

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Looking for transcription or transliteration standards latin- >arabic"

    On 03/07/2004 05:52, Donald Z. Osborn wrote:

    >I've read selected messages in this thread (on Unicode list) and some messages
    >bring to mind the thought of developing routines or standards to permit
    >toggling back and forth between standard Latin and Arabic transcriptions for
    >the same language, such as between the Boko and Ajami writing of Hausa. (Same
    >applies to any two or three transcription systems used for particular

    There are many languages in Asia which are written in Latin, Cyrillic or
    Arabic script, or at least two of the three, and the same kinds of
    problems apply to them. (There are also languages written in Arabic and
    Indic scripts, but I don't know enough about these to be helpful.)

    Toggling between Latin and Cyrillic scripts is relatively easy, although
    in some languages it is complicated in that single Cyrillic letters are
    used for two Latin ones, sometimes dependent on context e.g. Cyrillic e
    becomes Latin ye word initially but just e in other positions. Most of
    these conversions can be programmed easily, although there is a small
    problem with the new Uzbek Latin alphabet, deliberately based on ASCII
    only plus apostrophe serving as a diacritic, for sh, ch and gh are
    usually digraphs but in principle can be separate letters cf. cathode
    vs. cathouse.

    Changing in and out of Arabic script is much more complicated. The main
    issue is that Arabic loan words (which are common in most of these
    languages) usually have to be spelled exactly as in Arabic (oddly,
    except for TEH MARBUTA which becomes either TEH or HEH) even though many
    of the distinctions are lost in pronunciation and therefore in Latin and
    Cyrillic script. It is therefore impossible to transliterate
    automatically into Arabic script - even with an extensive dictionary
    there are potential ambiguities. The reverse direction is problematic
    mainly because Arabic script does not have a standardised way of marking
    all of the vowel distinctions made in Latin and Cyrillic, and anyway
    certain vowels are often not written at all.

    >Because there is generally not a 1-to-1 character correspondence in spellings in
    >different transcriptions, I wonder if you don't end up having to consider
    >something that operates a bit like machine translation, analyzing the context
    >of words in cases where transcription of a word in one system could be
    >transliterated into something misspelled or taken as more than one word in the
    >other system. Necessarily, I think, such routines would have to be

     From my experience, if Arabic script is involved, they would certainly
    have to be language-specific, and to achieve a correct result they would
    also need to be rather intelligent, or rely on human intervention. As a
    very simple example, "Sudan" in Turkish or Azerbaijani can be the name
    of a country or it can mean "from water", and the correct Arabic
    spellings are likely to be very different, and can be disambiguated only
    by complete parsing of the context.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Tue Jul 06 2004 - 05:10:43 CDT