RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 08 2003 - 19:17:34 EST

Next message: Elaine Keown: "Re: New symbols (was Qumran Greek)"

Previous message: Philippe Verdy: "RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Maybe in reply to: Peter Jacobi: "Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Next in thread: Peter Kirk: "Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Reply: Peter Kirk: "Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Jacobi said:

> Unicode doesn't prevent styling, of course. But having 'logical' order
> instead of 'visual' makes it a hard task for the application and the
> renderer.
> This is witnessed by the thin-spread support for this.

Yes, but having visual order instead of logical order makes
*other* tasks difficult for the application. There is a
tradeoff here.

The Brahmi-derived script which got a grandfathered visual order
into Unicode is Thai (and Lao), because of TIS 620-2533. That
definitely makes some aspects of coexistence with legacy data
easier for Thai in Unicode. But it also meant pulling in all
the complications for searching and sorting, among other things.
The Unicode Collation Algorithm (and every other implementation
of sorting) has to be special-cased for Thai, as a result, in
order to get expected ordering results. And we live with *that*
complication forever.

>
> 'Logical order' makes a lot sense for heavily conjunct forming,
> 2-D compositing
> scripts. It is not such a perfect match for Tamil, which is
> essentially 1-D and
> has a well-defined visual order of characters.

It doesn't stack as much as some other Brahmi-derived scripts,
but it still has some substantial ligating behavior (cf. -u, -uu),
and this is going to cause significant problems for attempting
to do systematic markup of particular syllabic pieces (consonants
as opposed to vowels, for example), *regardless* of the logical
order issue.

> But excuse my lamenting, I'm not
> into utopian and ill-advised projects of re-doing all from scratch.

Noted.

I concur with the Peters here. Markup of text is essentially
outside the scope of Unicode. The problem of how to do
various kinds of orthographic and/or linguistic markup and
highlighting in complex scripts while not arbitrarily breaking
the rendering itself is an issue for negotiation between
the markup protocols and the text rendering engines and fonts.

In an earlier post:

> So, to promote Unicode usage, in a community, which partly sees
> ISCII unification as a conspiracy against the Dravidian languages,
> it would be very helpful to demonstrate, that everything that can
> be done with the legacy encodings, can also be done using Unicode.

This is a bit off down the garden path, though. As the discussion
in this thread has made clear, we are talking about behavior
above the level of the representation of the plain text content.

First of all, it is quite evident that the same plain text
content as represented in TSCII can also be represented in
Unicode. That is the sufficiency test that has to be applied
to the Unicode Standard.

Against that, one balances the following pros and cons:

A. Con. It is more difficult to get browsers using Unicode to
take HTML span markup (color or whatever) of Tamil consonants
to render as expected when dealing with left-side (reordrant)
Tamil vowels or the two-part vowels. Because TSCII uses
visual order, such behavior is much more straightforward in
these particular cases.

B. Pro. It is much easier to get collaters to behave correctly
for Tamil data when dealing with left-side or two-part vowels,
because they are stored in logical order and do not add
complications on top of the already difficult issues of
syllable weighting for Tamil or other languages using Indic
scripts.

> Having an 'invisible consonant' to call for rendering of the vowel sign
> in isolation (and without the dotted circle), would also help the limited
> number of cases where the styled single character is needed - but in
> a rather hackish way.

That is what the SPACE as base character is for. If some renderers
insist on rendering such combinations with a dotted circle glyph,
that is an issue in the renderer -- it is not a defect in the
encoding standard for not having a way to represent the vowel
sign in isolation.

--Ken

Next message: Elaine Keown: "Re: New symbols (was Qumran Greek)"
Previous message: Philippe Verdy: "RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Maybe in reply to: Peter Jacobi: "Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Next in thread: Peter Kirk: "Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Reply: Peter Kirk: "Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 08 2003 - 19:58:56 EST