Re: Pashto yeh characters

Date: Wed Jul 28 2010 - 11:12:46 CDT

  • Next message: AndrĂ© Szabolcs Szelp: "Re: High dot/dot above punctuation?"

    Quoting Andreas Prilop <>:

    Hi Andreas,

    Thanks for the references to the old 7-bit and 8-bit Arabic character sets.


    I think these clearly show that alef maksura was the intention behind
    the dotless code point immediately preceding yeh, which later got
    incorporated into Unicode as U+0649.

    In terms of practice, Arabic-language documents are fairly consistent
    about using U+064A for yeh and U+0649 for alef maksura -- except in
    Egypt, which has a tradition of not distinguishing between alef
    maksura and yeh in final position (both are written without dots).
    Here's an arbitrary page from today's Al-Ahram newspaper, where both
    yeh and alef maksura are encoded as U+064A (the same holds for other
    pages of the site).

    On my computer this looks particularly jarring, because two dots are
    displayed on alef maksura in words like 'ila "to" and `ala "on". My
    locale is set to en_US, I wonder if an Egyptian locale setting would
    cause U+064A to display without dots.

    Going back to my original question about Pashto, unfortunately I
    cannot use the advice you gave in your initial reply, "Use whatever
    you want." I am not creating Pashto documents for print or electronic
    distribution, but rather working on automated language-processing
    tasks. It seems that the only workable solution would be to unify all
    U+064A and U+06CC characters found in Pashto documents into a single
    character for processing (and also U+0649 if we encounter it). It is
    unfortunate that a distinction between the characters cannot be used
    for disambiguating unvocalized Pashto text, but this appears to be the
    current state of affairs.


    This archive was generated by hypermail 2.1.5 : Wed Jul 28 2010 - 11:25:11 CDT