Arabic Presentation Forms-A

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 17 2003 - 10:08:57 EST

  • Next message: jcowan@reutershealth.com: "Re: [OT] CJK -> CJC (Re: Corea?)"

    I was validating some internal processing of strings, and I found these
    intrigating decompositions for Arabic Presentation forms-A. I was surprised
    to see that they are compatibility decomposed in (isolated) rows from bottom
    to top, in a distinct reading order from normal Arabic reading order for
    rows , but of coruse with the same right-to-left reading order:

    #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?;
    # RIAL SIGN
    fdfc;;;<isolated> 0631 06cc 0627 0644; # ??; ?; ?????;

    The "Arial Unicode MS" font does not have a glyph for the Rial currency sign
    so I won't comment lots about it, even if it's a special ligature of its
    component letters:
    - where the medial form of U+06CC ARABIC LETTER FARSI YEH (?) is shown on
    charts only as two dots (and not with its "Arabic letter alef maksura" base
    form, as the comment in Arabic chart suggests for Arabic letter yeh), which
    is
    - located on below-left of the medial form of U+0627 (?) ,
    - and where the initial form of U+0631 (?) kerns below its next two
    characters (sometimes with an aditional kashida below its next three
    characters). However the general layout is still one row, so the
    decomposition seems very quite reasonable; it's just regrettable that it's
    not found in Arial Unicode MS (unless this Rial sign is traditional and no
    more in actual use today).

    I'm not sure that the compatibility decomposition gives the accurate form
    for rendering the traditional glyph coded for the currency symbol...

    ------------------

    Now I have this one:

    #code;name;cc;
    # nfd;nfkdFolded;
    # #CHAR?; NFD?; NFKDFOLDED?;
    FDFA;ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM;0;
            FDFA;<isolated> 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639
    0644 064a 0647 0020 0648 0633 0644 0645;
            # ??; ??; ??? ???? ???? ?????;

    #code;name;cc;
    # nfd;nfkdFolded;
    # #CHAR?; NFD?; NFKDFOLDED?;
    FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0;
            FDFB;<isolated> 062c 0644 0020 062c 0644 0627 0644 0647;
            # ??; ??; ?? ??????;

    I note that the Unicode charts show them with their complex and highly
    ligated form, that correspond to the Arabic tradition in Quran. This is
    apparently not implemented in Microsoft fonts which just render only the
    first two on only 2 bottom-to-top rows.

    The compatibility decomposition creates 4 space-separated words WORD1,
    WORD2, WORD3, WORD4 that would be rendered normally either in one row as:
            WORD4 WORD3 WORD2 WORD1
    i.e.
            ??? ???? ???? ?????
    or on multiple narrow rows as:
            WORD1 or WORD2 WORD1
            WORD2 WORD4 WORD3
            WORD3
            WORD4
    i.e.
            ??? or ??? ????
            ???? ???? ?????
            ????
            ?????
    using the top-to-bottom normal layout of plain-text rows in Arabic.

    I can understand that it's difficult to make them fit more ideally like this
    (with kashidas noted by underscores) :
            WORD2
            _______WORD1
            W_______ORD3
            W___OR____D4
    i.e. actually this order:
            ????
                    ???
            ????
            ?????

    to better match the actual glyph in charts which also uses kashidas, given
    the height constraints in fonts, and the difficulty to create the
    traditional complex kerning between rows, but the current presentation of
    the alternate glyph chosen in Arial Unicode MS does not seems intuitive.
    Isn't there some requirement in Unicode to not change the common layout
    which is part of the character identity and structural for the script? Such
    interpretation problem does not occur in the presentation of U+FDFB (which
    also has two rows in the representative glyph of Arabic Presentation Forms-A
    charts). Is there an error here?

    ---------------------------

    Now with this one:

    #code;name;cc;
    # nfd;nfkdFolded;
    # #CHAR?; NFD?; NFKDFOLDED?;
    FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0;
            FDFB;<isolated> 062c 0644 0020 062c 0644 0627 0644 0647;
            # ??; ??; ?? ??????;

    The decomposition into WORD1 WORD2 follows the same principles but is less
    complex, and it uses this layout:
            WORD2 WORD1
    or:
            WORD1
            WORD2
    The second layout is used in Arial Unicode MS to render the ligature.

    ---------------------------

    Now I don't know why the last very complex but marvelous ligature U+FDFD in
    Unicode does not have a compatiblity decomposition. In fact I can't decipher
    clearly to what Arabic letters the ligature corresponds (this is not
    documented in Unicode, except through its English name, which is probably
    too far from the Arabic name to allow this identification)

    More generally, my question is related to the allowed modification of
    layouts for ligature glyphs in fonts: are they allowed, and how could they
    be acceptably be represented when the plain-text character is not
    compatibility-decomposed but rendered with a single glyph...

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Wed Dec 17 2003 - 11:00:36 EST