Re: logical order (and Tamil)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 26 2002 - 19:32:11 EDT


Ummm. Logical order, visual order, aural order, phonemic order,
linear order... We are in danger of losing track of the ground we
stand on.

Logical order versus visual order, in the Unicode Standard,
refers to the relationship between backing store order and
display order. The main issue is for bidirectional text.

Display:

"abcdefg ZYXW hijkl."
>>>>>>> <<<< >>>>>
 1111111 2222 33333

Logical Order Backing Store:

[abcdefg WXYZ hijkl.]

Visual Order Backing Store:

[abcdefg ZYXW hijkl.]

The two different orders for the backing store are not
mutually compatible. You cannot allow both of them as possibilities
for the overall standard without hopelessly mixing up the content
of text. (This is a little like the possibility of allowing combining
marks to come both after and *before* their base forms.)

There is a separate issue which has to do with alternative
models of Brahmi-derived scripts.

The "Indic model" is largely based on an abstraction of
the phonology of the language the script is used to write.
In principle, each consonant gets a character and each
vowel gets a character, and then the characters are placed
in the backing store in phonological order (i.e., in "aural
order", basically). However, because of complications in the
history of the scripts, some vowels in particular may be written
with two or even three graphic pieces, which may be over, under,
to the left of, or even on both sides of a consonantal letter.
As a result, "reordrant" or "surroundrant" characters are
included in the encoding as needed.

To make up a Devanagari-like artificial example with a
reordrant vowel

Phonemic content Backing store (chars) Display (glyphs)
                                                    e
/ka ki ku ke ko/ --> [k(a) ki ku ke ko] ==> {k ik k k ko}
                                                  u

The "Thai model" is a typewriter-derived variant of the
Indic model that rules out reordrant or surroundrant characters,
because of the limitations of typewriter technology. For Thai
itself, the issue is limited to reordrant vowels (5 of them),
which are specified to be entered into the backing store *before*
the consonant they are associated with in a syllable, rather than
*after* that consonant. The Thai model still allows combining
marks for vowels placed above or below, but chooses a different
backing store order solution for those vowels whose glyphs
appear to the left of consonants.

Again, an artificial Thai-like example:

Phonemic content Backing store (chars) Display (glyphs)
                                                i
/ka ki ku ke ko/ --> [ka ki ku ek ok] ==> {ka k k ek ok}
                                                  u

Note that bidirectionality is not at issue here. Both models
are unidirectional. There are simply local complications in
the mapping from phonology to backing store and backing store
to display. The reason why the Thai model is also referred
to as "visual order" encoding is because the mapping from
backing store to display is straightforward: the order of
the characters in the backing store "follows" the visual
order of the display glyphs -- at least insofar as we are
concerned with left-to-right placement of glyphs.

Note, however, that *both* of these models inherently imply non-linear
mappings at some level. In the Indic model, the mapping from
phonology to backing store is straightforward, but the mapping
from backing store to display (i.e., the "rendering") will
have local direction reversals and/or 1-2 character-to-glyph
mappings, in the case of reordrant or surroundrant vowels.
The Thai model displaces the mapping complexity to the
mapping from phonology to backing store, while simplifying the
rendering.

Given this picture, it should now be easier to see why Thai
rendering is easier than Devanagari, but Thai sorting
(which runs afoul of the mismatch between phonology and
backing store order) in more problematical. It is simply
a tradeoff of which level of processing gets the complexity.

Now Tamil is currently encoded in the Unicode Standard following
the generic Indic model. It has both reordrant and surroundrant
vowels.

Thus, again using an artificial example, we currently have:

Phonemic content Backing store (chars) Display (glyphs)

/ka ki ku ke ko/ --> [k(a) ki ku ke ko] ==> {k ki k ek ekA}
                                                  u

(where "A" here is standing for the glyph associated with the
long vowel /aa/, and the /o/ vowel is rendered with a surroundrant
vowel glyph: {e-A}).

What Sinnathurai Srivas' suggestions for Tamil seem to amount
to a reform of the writing system to get rid of reordrant,
surroundrant and ligating vowels. So a "pin-U" glyph is
introduced, which follows a consonant in display (without ligating).
And new e and o glyphs are introduced, which also follow a consonant
in display. If this is, indeed, the essence of the suggestion,
then the picture which would result is:

Phonemic content Backing store (chars) Display (glyphs)

/ka ki ku ke ko/ --> [k(a) ki ku ke ko] ==> {k ki kU kE kO}

(where "U", "E", and "O" are the 3 new introduced glyphs in
question).

If this is, indeed, the essence of "Linear Tamil", then
Sinnathurai's claim:

> The other point is, in the reform, there is no call for change in how Tamil
> Unicode is defined now. It is just a matter of how the complex rendering can
> be avoided and still get better, yes better redability of Tamil text.

would seem to be basically correct. Standard Tamil and Linear Tamil
would simply be two alternate renderings of the same backing
store. And if that is the case, then the Unicode Standard neither
promotes nor stands in the way, here. The encoding would be valid
in either case.

So I would like to get a clarification of MichKa's claim that:

> Sinnathurai Srivas is a member of INFITT's WG02 (Working Group 02, Unicode
> Tamil) who has been long advocating changes to Unicode Tamil that would be
> done in a "linear" manner that would remove the requirement of complex
> rendering. It would of course require many changes to rendering rules and
> character properties.

I can see, yes, that scheme B would have different rendering rules
than the (standard) scheme A, but where do we end up with different
character properties? All of the Tamil vowels (u, uu, e, ee, o, oo)
in question are combining marks of class 0, and would stay so in either
scheme. Or I guess another way of saying this is that while I see
obvious differences in *glyph* properties here, I don't see any call
for differences in *character* properties.

As to whether any significant community of users would find Sinnathurai's
suggested rendering and fonts acceptable or legible, and whether such
a scheme would catch on appreciably, I have no idea. As a mobile
messaging scheme it certainly looks closer to the original than the
kind of communication that can routinely be found in Docomo phones!

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 26 2002 - 17:56:13 EDT