Persian izafet (was: Re: Internet Explorer 5, Unicode Fonts, and Fontographer)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 01 1999 - 22:11:47 EST


Roozbeh wrote about U+06C0 ARABIC LETTER HEH WITH YEH ABOVE

>
> About this character, which I name YEH WITH SMALL YEH ABOVE, there's a bit
> of history in writing Persian. I am sorry I don't know anything about
> other languages using it.
>
> Of course, this is not a single character, but two. Persians consider it
> a HEH plusa a SMALL YEH ABOVE (a not-yet-in-Unicode non-spacing mark).
>

[much good historical information deleted]

>
> BTW, I think a compatiblity decomposition of HEH+ZWNJ+YEH should be added
> for this character. And a decomposition of HEH+SMALL YEH ABOVE after that
> character got in.

Unicode 2.0 (and all the updates to 2.1.9) provided no decomposition at
all for U+06C0, since there was no combining hamza (or yeh) encoded.

With the addition of U+0654 COMBINING HAMZA ABOVE to Unicode 3.0, a
decomposition of U+06C0 to U+06D5 + U+0654 has been added. The reason
for U+06D5 as the baseform is because of the difference in the cursive
joining behavior between U+06D5 and the regular Arabic letter HEH (U+0647).
The reason for using a combining hamza is because that is the visual
form shown in most fonts, regardless of the historical phonological
status of the letter(s) in question. See the published and approved
UnicodeData.txt file for the decomposition in question. The entry is:

06C0;ARABIC LETTER HEH WITH YEH ABOVE;Lo;0;AL;06D5 0654;;;;N;ARABIC LETTER HAMZAH ON HA;;;;

The problem with Roozbeh's suggestion is that UnicodeData.txt cannot
accomodate multiple, conflicting decompositions of the same character.
Implementations that depend on the decompositions would break.
The Unicode 3.0 decompositions are already final and are being
implemented widely. Furthermore, the use of ZWNJ in a decomposition
is unprecedented, and would lead to other problems.

However, implementers of extended Arabic for Persian and other
related languages should be aware that for *some* purposes, U+06C0
should be treated as ligation of HEH + YEH, while for other purposes,
it should be treated as an AE (HEH without normal joining) with a HAMZA
above it.

This example should be taken as another case in point: not all
implementation issues can be resolved by "fixing" the character
encoding standard. There are many difficult edge cases, where the
character encoding standard simply has to place a stake in the ground,
and then implementations of textual behavior will need to add
more information and fine-tuning to get the kind of end results needed
for particular communities, languages, or scholastic endeavors.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT