Re: kurdish sorani

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 30 2006 - 22:44:10 CDT

  • Next message: Philippe Verdy: "Re: kurdish sorani"

    ----- Original Message -----
    From: "John Hudson" <john@tiro.ca>
    Cc: <unicode@unicode.org>
    Sent: Wednesday, August 30, 2006 8:07 PM
    Subject: Re: kurdish sorani

    > Philippe Verdy wrote:
    >
    >> * U+06BE is rendered incorrectly with "Times New Roman" and "Arial" (3 shapes, effectively the final form should not be distinct from the medial form, although the isolated form correctly takes the shape of the initial form) and with "Arial Unicode MS" (1 shape: only the initial form); it is correct only for "Microsoft Sans Serif" and "Tahoma" (2 shapes: the initial form also used for the isolated case, and the medial form also used for the final case).
    >
    > The rendering with Arial Unicode MS (1 shape: only the initial form) is not incorrect. As
    > I've been saying: the shaping of the do chashmi he character varies according to the style
    > of writing or type used, and repetition of the single form is not only not incorrect it is
    > the norm for nasta'liq, i.e. the style in which the Urdu language is most often written.

    Not incorrect? possibly for Arabic only where the forms are considered equivalent; but as seen here, these forms are not equivalent in other languages. For the "Arial Unicode MS" which is intended to be used with a very wide range of languages, restricting the Arabic script to the Arabic language looks like a bad choice, and it would have still been better to maintain the coherence with other Windows core fonts, i.e. having Arial Unicode MS behave like Arial, notably when it is expected that it offers an extended coverage face to Arial alone. But here in fact, it just reduces the coverage of Arabic-written languages.

    My opinion, is that the defect in Arial Unicode MS for U+06BE is something that was forgotten in the design, due to the already large number of glyphs and contextual shaping rules embedded in that font. Comparing Arial Unicode MS (which was made for Office applications like Word, and so for correct typesetting of a larger number of languages than Arial alone made for the OS interface in a single localization) is a clear indication that this is a bug (actually Arial is not correct either for U+06BE).

    Where do we have a reference in OpenType to describe correctly the correct shaping behavior for various languages using the Arabic script? It looks like U+06BE was not covered in these specs, so this may explain why it was forgotten; and anyway, there's certainly a lack of agreement and specification for any other language than Arabic.

    All this suggests that the OpenType working group should really work on this script and start immediately a survey about the effective coverage of all languages using that script. It is clear from those discussions, that only the Arabic language has been seriously considered, and I fear that we learn that other issues are still not detected, even for languages that we know are (or were) written with the Arabic script (and there are many...).

    Did someone study for example the shaping behavior for all Central Asian languages, or many other Indo-Iranian languages other than Farsi? I'm sure you'll discover many language-specific adaptations of the script to take into account their unique phonetic/semantic distinctions. And given that the Arabic script was spread through islamic religion, it covers a much wider area than just the Middle-East, and you'll find other subtle things throughout Africa and upto China and Indonesia, or in some minorities of India (that has hundreds of languages)...

    How can then the shaping properties be made so normative without providing in Unicode ways to override this default joining behavior? Wasn't Unicode expected to encode characters and not specific languages? depending on font capacilities and specificities is certainly the worst way to solve the problem because this is definitely not a font design issue but really a semantic issue that must be solved in the encoding itself.

    So all I can say, is that Unicode should provide standard override mechanisms for the current (default) joining properties of Arabic characters (and possibly other scripts that have similar shaping/joining behavior). For now ZWNJ and ZWJ are clearly not enough because they affect the joining behavior of two characters on their sides, assuming that they will be both joining together or both disjoined.

    This looks quite similar to the shaping mechanism of Indic scripts (notably the consonnant clusters controled with a virama, which also has a semantic and phonetic importance). And given that the Arabic script covers the a large region where Indic scripts are used, I think that you'll find adaptations of the concept of virama into languages converted to the Arabic script, as a natural evolution which occured also across various Indic scripts (when attempting to adapt Sanskrit and Hindi to various scripts, or when adding new foreign phonems such as those from English, or mapping tones of East-Asian languages).

    Consider the many adaptation of the Latin script to cover lots of languages; many were added into Unicode. Why wouldn't it have happened to the Arabic script as well? So why not considering adding codes to solve the relative shaping ambiguity/freedom of the Arabic language using that script?



    This archive was generated by hypermail 2.1.5 : Wed Aug 30 2006 - 22:54:52 CDT