Re: Arabic renderer in four lines of Perl

From: Mark Davis (marked@best.com)
Date: Thu Jun 25 1998 - 19:26:11 EDT


Roman Czyborra wrote:

> Dear Mark:
>
> > Adapting a program or OS to work with Arabic is not a simple matter
> > of drawing Arabic presentation forms
>
> No, of course, there's more to it. I only said: it makes you wonder.
>
> But in situations where you're confronted with Arabic text and don't
> know how to print it, my little script does help. I also think that
> this cheap method could be used to add Arabic rendering capabilities
> to Mozilla which now still prints ??????? ??????? instead of Arabic.
>
> It also makes me wish that the missing South and East Asian and
> Extended Arabic presentation forms will get added to the next or

There are, practically speaking, an indefinitely large number of possible
glyph codes. Looking at TrueType, for example, the number and allocation of
glyph codes is completely arbitrary--each font designer can add any glyph,
with any mapping from characters to glyphs (including contextual mappings).
Adding more presentation forms is not the best way to deal with the
differences between character space and glyph space. There is an ISOWG2
document (I can't remember the reference right now) that goes into a great
deal of detail about this issue.

> second next plane so that Unicode can also serve as a complete glyph
> code because it already gives a useful glyph code for TrueType and
> BDF fonts and the ISO 10036 glyph registry does not seem to be in a
> state where there are complete and freely available 10646->10036
> mapping tables and 10036-implementing fonts.
>
> > It generally means both editing and line-wrapping Arabic text,
>
> A simple pipe filter is indeed no interactive editor. There is a good
> description "Arabization of User Interfaces" in
> +AFw-cite{delGaldo:1996:IUI}, see http://www.langbox.com/lastnews.html
>
> I am a bit puzzled by your mention of line-wrapping. Pierre MacKay
> wrote in his "Typesetting Problem Scripts" (Byte 1986-02 pp. 201-218)
> that you can simply apply the traditional line break algorithm for
> Arabic text also and do bidi reordering on the broken lines. Is that
> no longer valid? The description of bidi rendering in the Unicode
> standard does not mention line breaks and all of its examples are
> one-liners although the global writing direction is determined by
> block boundaries instead of line boundaries.

Since the character change width depending on their context, you can't
simply figure out the line breaks, then apply shaping--you will get lines
that are too long or too short. You need to do shaping, then determine line
breaks based on the logical order of the shaped characters, then visually
order the individual lines--assuming that your line break does not cause a
change in the shaping--with fancier fonts it may.

When you justify, you also need to be prepared to do kashida justification
("stretching glyphs") instead of adding to the widths of spaces. Some
characters also change shape significantly in a justified context (such as
the "snake" kaf).

>
>
> > Doing forms alone can be done in a few pages of code and data, as
> > you illustrate. For examples of what needs to be done for editing,
> > see http://www.ibm.com/java/education/international-text/index.html
>
> Thanks for that pointer!
>
> > The Bidirectional Algorithm is fully specified in the Unicode
> > Standard, Version 2.0 (plus online errata). You can order it online
> > if you want, through amazon.com or unicode.org.
>
> I know. I proudly own a copy and I have read it more than once. The
> lame excuse why I did not implement its bidi is that I see no point in
> wrenching my brain to get all of those data structures right if the
> Unicode Consortium has programmed the algorithm for me already.
>
> A more frank excuse would be that I find this particular chapter of
> the standard hard to grasp. It is neither machine-readable nor do I
> find motivations why this algorithm is better than others: Couldn't
> you have defined it a bit less complicated? What does the current
> system gain for numbers? Why aren't all Arabic digits simply stored
> as written in right-to-left order? Why aren't all decimal numbers
> simply stored in little-endian order so that you immediately know the
> value of the first digit you encounter? Doesn't the algorithm get the
> global direction wrong if my English sentence starts with an Arabic
> word? Wouldn't it be better to have no heuristics instead of insecure
> heuristics? Shall I blindly implement the bidi as specified or am I
> supposed to understand it and test it with some common sense? How am
> I supposed to break lines and can't I accept the bare +AFw-n line feed as
> a block separator also? At what point must I strip which control
> codes? Couldn't you have used quotation marks to jump up two
> embedding levels instead of invisible embedding control codes? Why
> does HTML/RFC2070 have entities for U+200-[EF] (‎‏) but not for
> U+200-[A-E] (LRE,RLE,PDF,LRO,RLO)? Does BDO="RTL" really equal RLO?
> How does the Unicode bidi interact with the bidi control codes of its
> source standard ISO6429?

Those are all good questions for a FAQ. Make sure that you read the errata
in Unicode 2.1 (http://www.unicode.org/unicode/reports/tr8.html) also.

While I don't have the time to answer your questions now, I can say that the
algorithm is designed to do a good job with normal text, and also allow for
"tweaking" for cases where it doesn't work properly. This implicit approach
has worked quite well for interactive typing on a number of systems. The
precise features of the algorithm are due to the collaboration of Arabic and
Hebrew experts working for Unicode corporate members, with additional
feedback from other experts.

The algorithm itself is not trivial, but it only takes a few pages to code
up properly if you are not worried about speed. If you are worried about
speed, then a state-table version--while more complicated--is worth the
extra effort. Such versions are probably viewed as IP by their companies,
and thus not made freely available.

However, the consortium is in the process of making a reference version of
the algorithm available, written in Java.

>
>
> Well, the more I look for open questions to rant about the more
> answers I find to my own questions but let me post them anyway.
> I suspect that some of these insecurities may also be the reason why
> nobody else has presented an example bidi implementation yet.
>
> > If you are doing a simple reversal simply for illustration, that's
> > not a problem, but I want to make sure that your readers understand
> > that that is not a compliant use of Unicode rendering for Arabic and
> > Hebrew.
>
> That's what I meant with my "oversimplify" warning.
>
> Cheers
> Roman
>
> PS:
>
> Mark Davis wrote:
> > Roman Czyborra wrote:
> > > #!/usr/local/bin/perl
> > >
> > > # arabjoin - a simple filter to render Arabic text
> > > # 1998-06-18 roman@czyborra.com
> > > # Freeware license at http://czyborra.com/
> > > # Latest version at http://czyborra.com/unicode/
> > > # PostScript printout at http://czyborra.com/unicode/arabjoin.ps.gz
> > >
> > > # echo "?? 1!" | arabjoin
> > > # prints !** ?*
>
> Your mailer seems to convert UTF-8 to ISO-8859-1 by
> misinterpreting the bytes in the C1 range as MacRoman.
>
> Content-Type: seems to be one of the few headers next to From: and
> Subject: that the unicode.org mailing list server does not distort but
> preserve. I did not send the following duplicate of my original message:
>
> > Message-Id: <9806201530.AA19238@unicode.org>
> > X-Uml-Sequence: 5372 (1998-06-20 15:23:53 GMT)
> > From: Roman Czyborra <czyborra@cs.tu-berlin.de>
> > To: Unicode List <unicode@unicode.org>
> > Date: Sat, 20 Jun 1998 08:23:51 -0700 (PDT)
> > Subject: Arabic renderer in four lines of Perl
>
> The repost might be connected with my crosspost to the moderated
> newsgroups comp.software.arabic but the mailing list exploder will
> never notice such things if it ignores my Message-ID.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT