Re: A simpler definition of the Bidi Algorithm

From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Sun Oct 17 2010 - 09:01:00 CDT

  • Next message: Asmus Freytag: "Re: A simpler definition of the Bidi Algorithm"

    My appologies for taking so long to respond. I've been busy with conferences.

    If you don't like the regular expression syntax, then they can just as
    easily be expressed as English prose:

    * A EuropeanNumber is a sequence of one or more groups of one or more
    class EN characters. The groups are separated by a single class ES or
    CS character.

    * A SequenceOfEuropeanNumbers is a sequence or one or more
    EuropeanNumber that are separated, preceeded and followed by zero or
    more class ET characters.

    * An ArabicNumber is a sequence of one or more groups of one or more
    class AN characters. The groups are separated by a single class CS
    character.

    * A EuroArabicNumber is a sequence of one or more groups of one or
    more class EN or AN characters. The groups are separated by a single
    class CS character.

    Since the the report claims that rules W2-7 are so the "text is next
    parsed for numbers." Then it only makes sense to give a grammar for
    what those numbers are as defining it this way does. The existing
    definitions are not such a clear grammar.

    (Note, my previous e-mail had I typo, I should have said "(EN+ sep-by
    (ES|CS)) bracket-by ET*" not "((EN NSM)+ sep-by ((ES|CS)) bracket-by
    ET*". The stray NSM was an abortive attempt at including W1 with
    W2-7. It is possible, but I think it clutters up the core
    definition.)

    As to why using regular expressions is better, note that these regular
    expressions are not the perversions that Perl calls regular
    expressions, but rather the very well behaved regular expressions from
    theoretical computer science and thus yield themselves to very
    efficient, constant space, single pass implementations.

    In fact, I would posit that when phrased this way, it makes it easy to
    combine all the X, W, N and I rules into a single pass algorithm that
    degenerates into the "test for right-to-left characters" optimization
    (mentioned in section 5.1) when there are no right-to-left characters.
     This is something that not even the C++ and Java reference
    implementations do (though it appears that the C++ implementation of
    the W rules was originally derived from a regular expression as it
    uses state tables, but if so it is undocumented). (Which by the way
    they have not been proven to be equivalent, they have merely been
    tested. Proof is a much more complicated formalism.)

    On Fri, Sep 10, 2010 at 8:50 PM, Khaled Hosny <khaledhosny@eglug.org> wrote:
    > On Fri, Sep 10, 2010 at 05:00:21PM -0700, Asmus Freytag wrote:
    >> PS: Personally, I don't find the presentation in terms of the
    >> regular expressions any more intuitive than the original.
    >
    > Some people, when confronted with a problem, think "I know,
    > I'll use regular expressions." Now they have two problems.
    > --Jamie Zawinski
    >
    > --
    > Khaled Hosny
    > Arabic localiser and member of Arabeyes.org team
    > Free font developer
    >



    This archive was generated by hypermail 2.1.5 : Sun Oct 17 2010 - 09:05:05 CDT