Re: Arabic renderer in four lines of Perl

From: John Cowan (cowan@locke.ccil.org)
Date: Thu Jun 25 1998 - 13:45:12 EDT


Roman Czyborra wrote:

> A more frank excuse would be that I find this particular chapter of
> the standard hard to grasp.

So do I, so do I.

> It is neither machine-readable nor do I
> find motivations why this algorithm is better than others: Couldn't
> you have defined it a bit less complicated? What does the current
> system gain for numbers? Why aren't all Arabic digits simply stored
> as written in right-to-left order? Why aren't all decimal numbers
> simply stored in little-endian order so that you immediately know the
> value of the first digit you encounter?

Storing European digits in LE order violates too strongly the
premises of plain text, and mandates bidi algorithms for LTR
scripts as well. Instead, storing all numbers in BE order and
working a little bit harder for Arabic digits at presentation
time seems to make more sense. Algorithms that want to determine
the value of digit strings don't have to care about bidi; they
can just do the classic "multiply current value by 10, add value
of next digit".

> Doesn't the algorithm get the
> global direction wrong if my English sentence starts with an Arabic
> word? Wouldn't it be better to have no heuristics instead of insecure
> heuristics?

"Secure heuristic" is a *contradictio in adjecto*; if the heuristic
were 100% reliable, it would be an algorithm, not a heuristic.
If you don't want to use heuristics, don't use them, but then you
have to give your users an out-of-band method of specifying the
global direction per block *and* insist that they use it.

> Shall I blindly implement the bidi as specified or am I
> supposed to understand it and test it with some common sense?

Well, bugs have been found and corrected, so there may be more
bugs.

> How am
> I supposed to break lines and can't I accept the bare \n line feed as
> a block separator also?

As the clarifications make known, various things can be treated as
a block boundary besides LS and PS, notably including whatever your
system accepts as a line terminator (LF, CR, or CR+LF), and even
things like "<BR>" and "<P>" if the text is known to be HTML.
These all meet the definitions of "higher-level protocols".

> At what point must I strip which control
> codes?

Clarified by the corrections.

> Couldn't you have used quotation marks to jump up two
> embedding levels instead of invisible embedding control codes?

When you need it to be invisible, you need it. You can't demand
that plain text be expressed using quotation marks when there are
no such marks in the original.

I can't comment on the HTML specifics, except to say that the
point of them is to allow the bidi overrides to be given in
high-level form so that both the HTML source and the result
of rendering it look reasonable. If you impose an RLO, you
do not want embedded markup to appear RTL when looking at the
HTML source.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT