From: Michael D. Adams (email@example.com)
Date: Sun Oct 17 2010 - 09:01:00 CDT
My appologies for taking so long to respond. I've been busy with conferences.
If you don't like the regular expression syntax, then they can just as
easily be expressed as English prose:
* A EuropeanNumber is a sequence of one or more groups of one or more
class EN characters. The groups are separated by a single class ES or
* A SequenceOfEuropeanNumbers is a sequence or one or more
EuropeanNumber that are separated, preceeded and followed by zero or
more class ET characters.
* An ArabicNumber is a sequence of one or more groups of one or more
class AN characters. The groups are separated by a single class CS
* A EuroArabicNumber is a sequence of one or more groups of one or
more class EN or AN characters. The groups are separated by a single
class CS character.
Since the the report claims that rules W2-7 are so the "text is next
parsed for numbers." Then it only makes sense to give a grammar for
what those numbers are as defining it this way does. The existing
definitions are not such a clear grammar.
(Note, my previous e-mail had I typo, I should have said "(EN+ sep-by
(ES|CS)) bracket-by ET*" not "((EN NSM)+ sep-by ((ES|CS)) bracket-by
ET*". The stray NSM was an abortive attempt at including W1 with
W2-7. It is possible, but I think it clutters up the core
As to why using regular expressions is better, note that these regular
expressions are not the perversions that Perl calls regular
expressions, but rather the very well behaved regular expressions from
theoretical computer science and thus yield themselves to very
efficient, constant space, single pass implementations.
In fact, I would posit that when phrased this way, it makes it easy to
combine all the X, W, N and I rules into a single pass algorithm that
degenerates into the "test for right-to-left characters" optimization
(mentioned in section 5.1) when there are no right-to-left characters.
This is something that not even the C++ and Java reference
implementations do (though it appears that the C++ implementation of
the W rules was originally derived from a regular expression as it
uses state tables, but if so it is undocumented). (Which by the way
they have not been proven to be equivalent, they have merely been
tested. Proof is a much more complicated formalism.)
On Fri, Sep 10, 2010 at 8:50 PM, Khaled Hosny <firstname.lastname@example.org> wrote:
> On Fri, Sep 10, 2010 at 05:00:21PM -0700, Asmus Freytag wrote:
>> PS: Personally, I don't find the presentation in terms of the
>> regular expressions any more intuitive than the original.
> Some people, when confronted with a problem, think "I know,
> I'll use regular expressions." Now they have two problems.
> --Jamie Zawinski
> Khaled Hosny
> Arabic localiser and member of Arabeyes.org team
> Free font developer
This archive was generated by hypermail 2.1.5 : Sun Oct 17 2010 - 09:05:05 CDT