Re: Pashto yeh characters

From: Roozbeh Pournader (
Date: Fri Oct 01 2010 - 16:55:04 CDT

  • Next message: Roozbeh Pournader: "Re: Pashto yeh characters"

    This is a rather late reply, but I think this document should be useful:

    The first few pages discuss and recommend various Yeh forms to be used,
    and a recommendation for avoiding some in certain forms.


    On Thu, 2010-07-22 at 12:17 -0500, wrote:
    > Hi,
    > This is a query I had originally sent to the Linguist List, modified
    > based on feedback I got there. I am hoping that someone in the Unicode
    > community can help resolve this.
    > I'm interested in knowing if there is a standard way to encode the
    > various Pashto yeh-characters in Unicode, and if so, what it is. This
    > question is a bit more complicated than it sounds, so here's the
    > background.
    > Pashto is written using a derivative of the Arabic script. The Arabic
    > language uses a single character for both /j/ and /i:/ sounds. Like
    > many Arabic characters, this one is composed of a base form (which
    > changes shape based on its position in a word) and dots (in this case,
    > two dots below the base form). In most of the Arabic-speaking world
    > the dots are present with both the medial and final form, though in
    > Egypt (and possibly other places) the convention is to have two dots
    > on the medial form but leave them off the final form. The standard
    > arrangement of the two dots is horizontal, but they can be placed
    > vertically or diagonally with no change in meaning.
    > Persian also uses a single character for /j/ and /i:/, with the
    > convention of two dots on the medial form, no dots on the final form
    > (same as in Egypt).
    > The two conventions for the /j/-/i:/ character were given distinct
    > code points in unicode despite the fact that they do not contrast;
    > documentation is scarce, but presumably this was done in order to
    > allow writing both Arabic and Persian in the same document. Therefore,
    > Unicode has the following code points (I'm not giving the names, but
    > rather the typical visual representation of the glyphs and typical use).
    > U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
    > U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)
    > There are a few additional yeh-base code points defined, some of which
    > are relevant to Pashto (see below).
    > U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
    > U+0626 hamza above medially and finally (Arabic glottal stop in
    > certain contexts)
    > U+06D0 two dots medially and finally in vertical arrangement
    > U+06CD tail and no dots in final position
    > As it so happens, there is much confusion in how these characters are
    > used in actual electronic documents, which is not surprising given
    > that U+06CC looks like U+064A in medial position but like U+0649 in
    > final position. There is an excellent article by Jonathan Kew that
    > sorts out what this means for various languages that use derivatives
    > of the Arabic script.
    > Unfortunately, this article does not discuss Pashto. I have little
    > knowledge of the language, but here's what I managed to understand
    > from the inspection of a few documents and with the help of friendly
    > people on the Linguist List (and please correct me if I'm wrong).
    > Traditionally, Pashto used a single character with the same convention
    > as in Persian, of two dots in the medial form and none on the final
    > form, and with no significance attached to the visual arrangement of
    > the dots. The character was 3-ways ambiguous between the sounds /j/,
    > /i:/ and /e/. In recent decades (probably since the 1970s or 1980s)
    > there has been some differentiation, partly due to changes in the
    > typesetting process and partly due to a deliberate effort of the
    > Pashto Academy at the University of Peshawar, Pakistan.
    > One convention that has gained fairly wide acceptance is a distinction
    > between a horizontal arrangement of the dots, representing /j/ or /i:/
    > as in Arabic and Persian, and a vertical arrangement representing the
    > sound /e/. This distinction is the same as in Uighur, and the
    > character with vertical dots has been codified as U+06D0. Additional
    > conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/
    > at the end of a word in certain grammatical markers. All of these are
    > quite standard by now and do not pose much of a problem.
    > However, a further convention appears to have arisen, which as far as
    > I can tell is unique to Pashto in that it distinguishes between /j/
    > and /i:/ (though only in word-final position):
    > /j/ is written with two dots medially, none finally
    > /i:/ is written with two dots both medially and finally
    > I have never seen this codified explicitly, but this is the impression
    > I get from examining a few recent Pashto documents. Which brings me to
    > my original question, of how to represent these characters in Unicode.
    > The linguist in me notices a correspondence between sounds and Unicode
    > code points (which, given the history I have just described, is most
    > certainly accidental):
    > /j/ corresponds to U+06CC
    > /i:/ corresponds to U+064A
    > The wikipedia article on the Pashto alphabet
    > gives a different
    > correspondence, based on visual appearance:
    > forms with dots: U+064A (/i:/ and /j/ medially, /i:/ finally)
    > forms without dots: U+0649 (only /j/ in word-final position)
    > And there is yet a third convention, which I encountered in an
    > electronic lexicon and also appears in the following document:
    > U+06CC: medial forms with dots (/i:/ and /j/) and dotless final form (/j/)
    > U+064A: final form with dots (/i:/)
    > To wrap up, are my observations about the Pashto writing conventions
    > correct? And is there a standard for assigning the Pashto characters
    > representing /j/ and /i:/ to Unicode code points?
    > -Ron.

    This archive was generated by hypermail 2.1.5 : Fri Oct 01 2010 - 16:58:23 CDT