Re: Arabic renderer in four lines of Perl

From: Roman Czyborra (czyborra@cs.tu-berlin.de)
Date: Wed Jun 24 1998 - 16:24:58 EDT


Dear Mark:

> Adapting a program or OS to work with Arabic is not a simple matter
> of drawing Arabic presentation forms

No, of course, there's more to it. I only said: it makes you wonder.

But in situations where you're confronted with Arabic text and don't
know how to print it, my little script does help. I also think that
this cheap method could be used to add Arabic rendering capabilities
to Mozilla which now still prints ??????? ??????? instead of Arabic.

It also makes me wish that the missing South and East Asian and
Extended Arabic presentation forms will get added to the next or
second next plane so that Unicode can also serve as a complete glyph
code because it already gives a useful glyph code for TrueType and
BDF fonts and the ISO 10036 glyph registry does not seem to be in a
state where there are complete and freely available 10646->10036
mapping tables and 10036-implementing fonts.

> It generally means both editing and line-wrapping Arabic text,

A simple pipe filter is indeed no interactive editor. There is a good
description "Arabization of User Interfaces" in
\cite{delGaldo:1996:IUI}, see http://www.langbox.com/lastnews.html

I am a bit puzzled by your mention of line-wrapping. Pierre MacKay
wrote in his "Typesetting Problem Scripts" (Byte 1986-02 pp. 201-218)
that you can simply apply the traditional line break algorithm for
Arabic text also and do bidi reordering on the broken lines. Is that
no longer valid? The description of bidi rendering in the Unicode
standard does not mention line breaks and all of its examples are
one-liners although the global writing direction is determined by
block boundaries instead of line boundaries.

> Doing forms alone can be done in a few pages of code and data, as
> you illustrate. For examples of what needs to be done for editing,
> see http://www.ibm.com/java/education/international-text/index.html

Thanks for that pointer!

> The Bidirectional Algorithm is fully specified in the Unicode
> Standard, Version 2.0 (plus online errata). You can order it online
> if you want, through amazon.com or unicode.org.

I know. I proudly own a copy and I have read it more than once. The
lame excuse why I did not implement its bidi is that I see no point in
wrenching my brain to get all of those data structures right if the
Unicode Consortium has programmed the algorithm for me already.

A more frank excuse would be that I find this particular chapter of
the standard hard to grasp. It is neither machine-readable nor do I
find motivations why this algorithm is better than others: Couldn't
you have defined it a bit less complicated? What does the current
system gain for numbers? Why aren't all Arabic digits simply stored
as written in right-to-left order? Why aren't all decimal numbers
simply stored in little-endian order so that you immediately know the
value of the first digit you encounter? Doesn't the algorithm get the
global direction wrong if my English sentence starts with an Arabic
word? Wouldn't it be better to have no heuristics instead of insecure
heuristics? Shall I blindly implement the bidi as specified or am I
supposed to understand it and test it with some common sense? How am
I supposed to break lines and can't I accept the bare \n line feed as
a block separator also? At what point must I strip which control
codes? Couldn't you have used quotation marks to jump up two
embedding levels instead of invisible embedding control codes? Why
does HTML/RFC2070 have entities for U+200[EF] (‎‏) but not for
U+202[A-E] (LRE,RLE,PDF,LRO,RLO)? Does BDO="RTL" really equal RLO?
How does the Unicode bidi interact with the bidi control codes of its
source standard ISO6429?

Well, the more I look for open questions to rant about the more
answers I find to my own questions but let me post them anyway.
I suspect that some of these insecurities may also be the reason why
nobody else has presented an example bidi implementation yet.

> If you are doing a simple reversal simply for illustration, that's
> not a problem, but I want to make sure that your readers understand
> that that is not a compliant use of Unicode rendering for Arabic and
> Hebrew.

That's what I meant with my "oversimplify" warning.

Cheers
Roman

PS:

Mark Davis wrote:
> Roman Czyborra wrote:
> > #!/usr/local/bin/perl
> >
> > # arabjoin - a simple filter to render Arabic text
> > # © 1998-06-18 roman@czyborra.com
> > # Freeware license at http://czyborra.com/
> > # Latest version at http://czyborra.com/unicode/
> > # PostScript printout at http://czyborra.com/unicode/arabjoin.ps.gz
> >
> > # echo "Ø£Ù?ÙÑاÙ? باÙÑØ1اÙÑÙÖ!" | arabjoin
> > # prints !ﻢï»üïº*ï»åï»üïº*ïºë Ù?ï»*ﻫïºÉ

Your mailer seems to convert UTF-8 to ISO-8859-1 by
misinterpreting the bytes in the C1 range as MacRoman.

Content-Type: seems to be one of the few headers next to From: and
Subject: that the unicode.org mailing list server does not distort but
preserve. I did not send the following duplicate of my original message:

> Message-Id: <9806201530.AA19238@unicode.org>
> X-Uml-Sequence: 5372 (1998-06-20 15:23:53 GMT)
> From: Roman Czyborra <czyborra@cs.tu-berlin.de>
> To: Unicode List <unicode@unicode.org>
> Date: Sat, 20 Jun 1998 08:23:51 -0700 (PDT)
> Subject: Arabic renderer in four lines of Perl

The repost might be connected with my crosspost to the moderated
newsgroups comp.software.arabic but the mailing list exploder will
never notice such things if it ignores my Message-ID.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT