A simpler definition of the Bidi Algorithm

From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Fri Sep 10 2010 - 18:07:18 CDT

  • Next message: Asmus Freytag: "Re: A simpler definition of the Bidi Algorithm"

    Rules W2 through W7 for the Bidi Algorithm
    [http://www.unicode.org/reports/tr9/] are rather confusing to read.
    They are not confusing as to what to do but as to why they are done
    and how to efficiently implement them. After many hours puzzling over
    them I think I've found a simpler way to define them. Is the
    following definition equivalent to the specification's rules? If so
    why isn't the Bidi Algorithm defined using this simpler specification?

    My simpler specification is as follows:

    Assume standard regular expression syntax where suffix "|" is
    alternation, suffix "+" is one or more repetitions, and suffix "*" is
    zero or more repetitions. Let "X sep-by Y" be a shorthand for one or
    more "X" separated "Y" (i.e. X (Y X)*). Let "X bracket-by Y" be a
    shorthand for one or more "X" separated and surrounded by "Y" (i.e. "Y
    (X sep-by Y) Y" or "Y (X Y)+"). Upper case characters represent the
    bidi_class of a character.

    Define a SequenceOfEuropeanNumbers to be a maximally long contiguous
    sequence of characters that match "((EN NSM)+ sep-by ((ES|CS))
    bracket-by ET*".

    Define an ArabicNumber to be a maximally long contiguous sequence of
    characters that match "AN+ sep-by CS".

    Define a EuroArabicNumber to be a maximally long contiguous sequence
    of characters that match "(AN|EN)+ sep-by CS".

    Between each strong character (AL,L,R,sor) and the next strong
    character (AL,L,R,eor):
      If the leading strong character is L then:
        (1) change the class of all characters in a SequenceOfEuropeanNumbers to L
        (2) change the class of all characters in a ArabicNumber to AN
        (3) change all other characters to ON.
      If the leading strong character is R then:
        (1) change the class of all characters in a SequenceOfEuropeanNumbers to EN
        (2) change the class of all characters in a ArabicNumber to AN
        (3) change all other characters to ON.
      If the leading strong character is AL then:
        (1) change the class of all characters in a EuroArabicNumber to AN
        (2) change all other characters to ON.

    At this point all AL characters can be changed to R and the normal N1
    and N2 rules resumed.

    I believe specifying things this way is more intuitive than the
    existing way and will make it easier for implementers to properly and
    efficiently implement. Am I wrong? Is there a good reason W2 through
    W7 are they way they are? If not, can they be changed to this simpler
    specification?

    Michael D. Adams



    This archive was generated by hypermail 2.1.5 : Fri Sep 10 2010 - 18:14:18 CDT