Re: Arabic renderer in four lines of Perl

From: Mark Davis (marked@best.com)
Date: Fri Jun 19 1998 - 11:27:31 EDT


Roman,

> # This little script also demonstrates that Arabic rendering is not
> # that complicated after all (it makes you wonder why some
> software
> # companies are still asking hundreds of dollars from poor students
> # who just want to print their Arabic texts) and that even Perl 4 can
> # handle Unicode text in UTF-8 without any nifty new add-ons.
>
Adapting a program or OS to work with Arabic is not a simple matter of drawing
Arabic presentation forms and applying the Unicode BIDI algorithm. It
generally means both editing and line-wrapping Arabic text, and doing
everywhere with the same quality that English text is supported. Doing forms
alone can be done in a few pages of code and data, as you illustrate. For
examples of what needs to be done for editing, see:

http://www.ibm.com/java/education/international-text/index.html

> # Until the Unicode Consortium publishes its Unicode Technical
> # Report #9 (Bidirectional Algorithm Reference Implementation)
> # at http://www.unicode.org/unicode/reports/techreports.html
> # let us oversimplify things a bit and reverse everything:
>
The Bidirectional Algorithm is fully specified in the Unicode Standard,
Version 2.0 (plus online errata). You can order it online if you want, through
amazon.com or unicode.org.

If you are doing a simple reversal simply for illustration, that's not a
problem, but I want to make sure that your readers understand that that is not
a compliant use of Unicode rendering for Arabic and Hebrew.


> # Bitstream's Cyberbit font offers 57 of the other 466 optional
> # ligatures in the U+FB50 Arabic Presentation Forms-A block:
>
 ...

> # The following table lists the presentation variants of each
> # character. Each value from the U+0600 block means that the
> # necessary glyph variant has not been assigned a code in Unicode's
> # U+FA00 compatibility zone. You may want to insert your private
> # glyphs or approximation glyphs for them:
>
This is one of the items that makes full Arabic support more interesting. Each
font generally will have a different set of ligatures, and the font is also
not limited to the presentation forms that are encoded in Unicode. (The
particular set of ligatures that happens to be in Unicode is there because it
was in an earlier version of ISO 10646.) For examples of font technology that
supports arbitrary mappings from characters to glyphs, look at:

http://www.adobe.com/supportservice/devrelations/opentype/main.htm
http://fonts.apple.com/WhitePapers/GXvsOTLayout.html
http://www.truetype.demon.co.uk/ttgx.htm

Fonts fall into roughly two categories:
- simple fonts: those restricted to the set of ligatures encoded as
presentation forms in Unicode.
- complex fonts: those which have ligatures that are not associated with a
Unicode character, and for each ligature, contain a mapping between a sequence
of Unicode characters (possibly within a context) and the resultant glyph code
for the ligature in question.

This means that in order to do a comprehensive job, one would need to handle
both simple and complex fonts. For simple fonts, each font needs to be
queried to see exactly which ligatures it does have. For complex fonts, the
mappings have to be loaded and used in order to determine the correct
ligatures.


Roman Czyborra wrote:

> #!/usr/local/bin/perl
>
> # arabjoin - a simple filter to render Arabic text
> # © 1998-06-18 roman@czyborra.com
> # Freeware license at http://czyborra.com/
> # Latest version at http://czyborra.com/unicode/
> # PostScript printout at http://czyborra.com/unicode/arabjoin.ps.gz
>
> # This filter takes Arabic text (encoded in UTF-8 using the Unicode
> # characters from the U+0600 Arabic block in logical order) as input
> # and performs Arabic glyph joining on it and outputs a UTF-8 octet
> # stream that is no longer logically arranged but in a visual order
> # which gives readable results when formatted with a simple Unicode
> # renderer like Yudit that does not handle Arabic differently yet
> # but simply outputs all glyphs in left-to-right order.
>
> # This little script also demonstrates that Arabic rendering is not
> # that complicated after all (it makes you wonder why some software
> # companies are still asking hundreds of dollars from poor students
> # who just want to print their Arabic texts) and that even Perl 4 can
> # handle Unicode text in UTF-8 without any nifty new add-ons.
>
> # Usage examples:
>
> # echo "أ?ا? با1ا!" | arabjoin
> # prints !ﻢ** ?*ﻫ
> # which is the Arabic version of "Hello world!"
>
> # | recode ISO-8859-6..UTF-8 | arabjoin | uniprint -f cyberbit.ttf
> # prints an Arabic mail of charset=iso-8859-6-i on your printer
>
> # | arabjoin | xviewer yudit
> # delegates an Arabic UTF-8 message to a better viewer
>
> # ftp://sunsite.unc.edu/pub/Linux/apps/editors/X/ has uniprint in yudit-1.0
> # ftp://ftp.iro.umontreal.ca/pub/contrib/pinard/pretest/ has recode-3.4g
> # http://czyborra.com/unicode/ has arabjoin
> # http://czyborra.com/unix/ has xviewer
> # http://www.bitstream.com/cyberbit.htm or
> # ftp://ccic.ifcss.org/pub/software/fonts/unicode/ms-win/ or
> # ftp://ftp.irdu.nus.sg/pub/language/bitstream/ has cyberbit.ttf
>
> # This is how we do it: First we learn the presentation forms of each
> # Arabic letter from the end of this script:
>
> while(<DATA>)
> {
> ($char, $_) = /^(\S+)\s+(\S+)/;
> ($isolated{$char},$final{$char},$medial{$char},$initial{$char}) =
> /([\xC0-\xFF][\x80-\xBF]+)/g;
> }
>
> # Then learn the (incomplete set of) transparent characters:
>
> foreach $char (split (" ", "
> ? * * * * ٰ
> s ? ۡ ۢ ۣ ۤ ۧ ۨ ۪ ۫ ۬ ?"))
> {
> $transparent{$char}=1;
> }
>
> # Finally we can process our text:
>
> while (<>)
> {
> s/\n$//; # chop off the end of the line so it won't jump upfront
>
> @uchar = # UTF-8 character chunks
> /([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)/g;
>
> # We walk through the line of text and do contextual analysis:
>
> for ($i = $[; $i <= $#uchar; $i = $j)
> {
> for ($b=$uchar[$j=$i]; $transparent{$c=$uchar[++$j]};){};
>
> # The following assignment is the heart of the algorithm.
> # It reduces the Arabic joining algorithm described on
> # pages 6-24 to 6-26 of the Arabic character block description
> # in the Unicode 2.0 Standard to four lines of Perl:
>
> $uchar[$i] = $a && $final{$c} && $medial{$b}
> || $final{$c} && $initial{$b}
> || $a && $final{$b}
> || $isolated{$b}
> || $b;
>
> $a = $initial{$b} && $final{$c};
> }
>
> # Until the Unicode Consortium publishes its Unicode Technical
> # Report #9 (Bidirectional Algorithm Reference Implementation)
> # at http://www.unicode.org/unicode/reports/techreports.html
> # let us oversimplify things a bit and reverse everything:
>
> $_= join ('', reverse @uchar);
>
> # The following 8 obligatory LAM+ALEF ligatures are encoded in the
> # U+FE70 Arabic Presentation Forms-B block in Unicode's
> # compatibility zone:
>
> s//ﻵ/g;
> s//ﻶ/g;
> s//ﻷ/g;
> s//ﻸ/g;
> s//1/g;
> s//ﻺ/g;
> s/*/ﻻ/g;
> s/*/*/g;
>
> # Bitstream's Cyberbit font offers 57 of the other 466 optional
> # ligatures in the U+FB50 Arabic Presentation Forms-A block:
>
> s/ﻢ/*/g;
> s/2/2/g;
> s/*/ﰿ/g;
> s/ﺢ/*/g;
> s/|/*/g;
> s/ﻢ//g;
> s/ﻰ//g;
> s/2//g;
> s/ﻢﻧ/*/g;
> s//*/g;
> s/*//g;
> s/*//g;
> s/*/ﱡ/g;
> s/*/ﱢ/g;
> s/ﺮ/ﱪ/g;
> s/|/?/g;
> s/2/ﱯ/g;
> s/ﺮ/ﱰ/g;
> s/|/3/g;
> s/2/ﱵ/g;
> s/2ﻨ/2*/g;
> s/ﺮﻴ/2/g;
> s/|ﻴ/2/g;
> s//2/g;
> s/ﺤ/2*/g;
> s/ﺨ/2*/g;
> s/ﻤ/2/g;
> s//2/g;
> s/ﺤ/2/g;
> s/ﺨ/2/g;
> s/ﻤ/2/g;
> s/ﻤ?/2|/g;
> s/ﻤ/2/g;
> s/ﻤﺣ/2/g;
> s/ﻤﺧ/2/g;
> s/ﻤ3/2/g;
> s//3/g;
> s/ﺤ/3S/g;
> s/ﺨ/3?/g;
> s/ﻤ/3/g;
> s/ﻬ/3*/g;
> s/ﻣ/3*/g;
> s/ﺤﻣ/3*/g;
> s/ﺨﻣ/3*/g;
> s/ﻤﻣ/3/g;
> s/ﻧ/3/g;
> s/ﺤﻧ/3/g;
> s/ﺨﻧ/3/g;
> s/ﻤﻧ/3/g;
> s/3/3s/g;
> s/ﺤ3/3?/g;
> s/ﺨ3/3/g;
> s/ﻤ3/3*/g;
> s/ﺤﻤ//g;
> s/ﻪ*/2/g;
> s/ﻢ3?/ﻪﻴ?/g;
> s/ﻪ*/*/g;
>
> print "$_\n";
> }
>
> # The following table lists the presentation variants of each
> # character. Each value from the U+0600 block means that the
> # necessary glyph variant has not been assigned a code in Unicode's
> # U+FA00 compatibility zone. You may want to insert your private
> # glyphs or approximation glyphs for them:
>
> __END__
> ء *
> آ *
> أ
> ؤ ?
> إ ?
> | S?
> ا **
> ب **
> ة
> ت ?
> ث s?
> ج **
> ? ﺡﺢﺤﺣ
> خ ﺥ|ﺨﺧ
> د ﺩﺪ
> ذ ﺫﺬ
> ر ?ﺮ
> 2 ﺯﺰ
> 3 ﺱ2ﺴ3
> ش ﺵﺶﺸﺷ
> ص 1ﺺ*ﺻ
> ض ***ﺿ
> ط *
> ظ ??
> 1 S?
> غ ****
> * ****
> *
> ?
> s?
> **
> ﻡﻢﻤﻣ
> ? ﻥ|ﻨﻧ
> ? ﻩﻪﻬﻫ
> ?ﻮ
> ﻯﻰ // ﯩﯨ
> S ﻱ2ﻴ3
> ٱ ?* // ?
> 2 22
> 3 33
> ٴ ٴ
> ٵ ٵٵ
> ٶ ٶٶ
> ٷ *ٷ
> ٸ ٸٸٸٸ
> 1 ?|???
> ٺ ?*???
> ٻ ????
> * ****
> * ****
> * ?????
> ٿ ????
> * ?s???*?
> * ****
>
> ???1?
> ?2?3??
>
> ? ???*?*
> ? ?*?**
>
>
> S SS
> ? ??
>
> *
> * ??
> * **
> * **
> *
>
>
>
>
> ? ?
>
> S?
>
> s ssss
> ? ????
>
> * ****
> * ****
>
>
> ڡ ڡڡڡڡ
> ڢ ڢڢڢڢ
> ڣ ڣڣڣڣ
> ڤ ?????
> ڥ ڥڥڥڥ
> | ????
> ڧ ڧڧڧڧ
> ڨ ڨڨڨڨ
> ک ***
> ڪ ڪڪڪڪ
> ګ ګګګګ
> ڬ ڬڬڬڬ
> ? ?
> ڮ ڮڮڮڮ
> گ
> ڰ ڰڰڰڰ
> ڱ s?*
> 2 2222
> 3 ?
> ڴ ڴڴڴڴ
> ڵ ڵڵڵڵ
> ڶ ڶڶڶڶ
> ڷ ڷڷڷڷ
> ں *ںں
> ڻ ﮡﮣﮢ
> * ****
> * ****
> * ﮪﮫ?ﮬ
> * ﮤﮥ
> * |ﮧﮩﮨ
>
>
>
> ﯡ
> ? s
> ?
> ?
> ﯢﯣ
> S SS
> ? *
> **ﯿ*
> * **
> * ****
> * ﯤﯥﯧ|
> * ****
>
>
>
>
> ? ??
> ? ??
>
>
> S SS
> ? ??
>
> * **
> * ****
> * ****
>
> ﮮﮯ
> ﮰﮱ
>
> ** ********





This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT