RE: Directionality Standard

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Tue Dec 18 2007 - 16:27:45 CST

  • Next message: Frank Ellermann: "Re: A message misplaced?"

    Jony and Mark are right. However, I think there is some confustion here.
     
    The bidi algorithm specifies a how to derive a *default* *top level* paragraph direction.
    The directions within the paragraph may still of course vary, e.g. for a Latin
    word inside an otherwise RTL paragraph.
     
    This is just a *default* top level paragraph direction, for plain text. And one may
    use a leading RLM or LRM invisible character to force the direction in pure plain
    text contexts.
     
    However, when it is not just plain text, but some kind of "rich text", the top level
    paragraph direction is often set by other means (and the *default* direction is not
    computed). E.g. in HTML it is set by a dir="rtl" or dir="ltr" attribute. But some
    systems use a language tag (including a script subtag if present) to derive a top
    level paragraph direction from that (including RSS apparently, but I know of
    another system that does that too).
     
        /kent k
     

      _____

    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Jony Rosenne
    Sent: Tuesday, December 18, 2007 9:58 PM
    To: unicode@unicode.org
    Subject: RE: Directionality Standard

    Since you are apparently writing RTL phrases within a LTR context the default behavior suits your usage pattern. I commonly write RTL documents and when a paragraph starts with a Latin letter the default behavior messes things up. So myself and my colleagues need to override the default behavior.

     

    Jony

     

    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of pemuro
    Sent: Tuesday, December 18, 2007 7:44 PM
    To: Mark Davis
    Cc: Behnam; Stephane Bortzmeyer; (unicode@unicode.org) List; Otto Stolz; Magda Danish (Unicode)
    Subject: Re: Directionality Standard

     

    Interesting Mark that the bidi-Algorithm defines the default direction of a paragraph. I took bidi to mean text which uses both directionalities (in some textpassage, NOT necessarily changing only from paragraph to paragraph).
    I like to collect etymological information of one word in one paragraph switching often several time per line between R2L and L2R and had no problems with corrrect linebreaking e.g. with a semitic phrase embedded in an English text). I was not missing the RTL-paragraph-icon in Open-Office (exits now). In Word switching such a multi-language paragraph's directionality made it unrecognizable. I thought these icons a relic of pre-bidi/ pre-Unicode times when some fonts did not contain automatic directionality.
    Only one irritant exists: Entering the directionally neutral SPace after an RTL-word makes the cursor jump to the right (when the paragraph is LTR) before i enter the next RTL-letter when it jumps back to the left. I consider this jumping premature! It would be more logical to decide only after the next symbol with directionality is entered.
    That is of course not regulated by Unicode, but I would much prefer it to be the default behavior.

    For my mnemonic multi-Latin++ (IPA, Sanskrit, ) keyboard layout and textexamples look at the directory pemuro.funpic.de

    pemuro

    Mark Davis wrote:

    There may be some misunderstanding. Unicode does define the default direction of a paragraph for use with the bidi algorithm (which determines the ordering of characters containing bidirectional scripts like Arabic or Hebrew).

    See http://unicode.org/reports/tr9/

    Mark

    On Dec 17, 2007 4:23 PM, Behnam <behnam.rassi@gmail.com <mailto:behnam.rassi@gmail.com> > wrote:

    Thank you.

    So the answer is no. Unicode does not define the directionality of a paragraph. Then I guess my next question should be why?

    I think I have some explaining to do.

    Unicode defines a very complex bidi behaviour of characters, and it defines the beginning and ending of a paragraph (I assume). Yet, it doesn't define what directionality this paragraph should take to arrange these characters within the paragraph.

    Defining the directionality of a paragraph is more important than defining the language of a text. Yes, language tag can help language aware devices and applications behave accordingly. But directionality definition is not about ' user friendly' behaviour of a text, it is about reproducing the raw text, as intended by its Unicode encoding.

    Understanding this issue I suppose, may be very easy or very difficult, depending on to the extend you were exposed to rtl experience. In the next paragraph, I write a Persian line, throwing a couple of English words within, and in left to right directionality to give you an idea about what right to left users are experiencing in everyday basis.

    پرسش من از Unicode این است که چرا برای پاراگراف directionality تبیین نکرده است.

    In order to read the above phrase correctly in Persian, the order of words should be as I numbered below (from right to left):

    پرسش1 من2 از3 Unicode4 این5 است6 که7 چرا8 برای9 پاراگراف10 directionality11 تبیین12 نکرده13 است14.

     

    Of-course I can set this paragraph in my application to "rtl" and thanks to wonders of bidi behaviour of characters, everything will be put in place:

     

    پرسش من از Unicode این است که چرا برای پاراگراف directionality تبیین نکرده است.

     

    But I have absolutely no guarantee that my rtl text in an email, in a text message, in an online forum posting... will be received in rtl setting. This perfectly Unicode encoded text is at the mercy of applications, devices, mediums and platforms. And more likely than not, my rtl paragraph will be received in ltr and in the order that I numbered above! Even in a more controlled situations such as word processors, as a friend of mine has experienced, this Persian phrase written in rtl setting of Nisus on a Mac, exported in a .doc format, and opened on a Windows platform will produce an rtl, but 'Arabic' document! not only an Arabic script document which is, but an Arabic language document!

     

    You can experiment this dilemma yourself. Set your application to rtl (which can be done in many applications), write something in English or any Roman language. As long as the whole phrase is Roman, you only get a misplaced final period in far left. But if you throw a couple of Hebrew words within the phrase, then you'll see what a wrong directionality setting can do to your English. Of-course you are not exposed to this dilemma because the default directionality of all computerized devices and applications is left to right. But it gives you an idea what rtl users are going through in everyday basis.

     

    Again, this is not about requesting a convenience. It is about requesting Unicode to do what it is set to do. Unicode encodes bidi behaviour of characters, the beginning of a paragraph, the end of a paragraph. It must encode its directionality too.

     

    Behnam

     

     

    On 17-Dec-07, at 4:20 AM, Stephane Bortzmeyer wrote:

    On Sat, Dec 15, 2007 at 11:08:40AM -0500,

     Behnam <behnam.rassi@gmail.com> wrote

     a message of 78 lines which said:

     

    Is there any Unicode standard to identify a text? i.e. primary

    script>directionality>language?

     

    Not an Unicode standard but, yes, there is a standard to tag texts to

    indicate language, script, etc. It's RFC 4646. See

    http://www.langtag.net/ for a start.

     

    -- 
    Mark 
     
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 18 2007 - 16:30:16 CST