RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 08 2003 - 19:17:34 EST

  • Next message: Elaine Keown: "Re: New symbols (was Qumran Greek)"

    Peter Jacobi said:

    > Unicode doesn't prevent styling, of course. But having 'logical' order
    > instead of 'visual' makes it a hard task for the application and the
    > renderer.
    > This is witnessed by the thin-spread support for this.

    Yes, but having visual order instead of logical order makes
    *other* tasks difficult for the application. There is a
    tradeoff here.

    The Brahmi-derived script which got a grandfathered visual order
    into Unicode is Thai (and Lao), because of TIS 620-2533. That
    definitely makes some aspects of coexistence with legacy data
    easier for Thai in Unicode. But it also meant pulling in all
    the complications for searching and sorting, among other things.
    The Unicode Collation Algorithm (and every other implementation
    of sorting) has to be special-cased for Thai, as a result, in
    order to get expected ordering results. And we live with *that*
    complication forever.

    >
    > 'Logical order' makes a lot sense for heavily conjunct forming,
    > 2-D compositing
    > scripts. It is not such a perfect match for Tamil, which is
    > essentially 1-D and
    > has a well-defined visual order of characters.

    It doesn't stack as much as some other Brahmi-derived scripts,
    but it still has some substantial ligating behavior (cf. -u, -uu),
    and this is going to cause significant problems for attempting
    to do systematic markup of particular syllabic pieces (consonants
    as opposed to vowels, for example), *regardless* of the logical
    order issue.
     
    > But excuse my lamenting, I'm not
    > into utopian and ill-advised projects of re-doing all from scratch.

    Noted.

    I concur with the Peters here. Markup of text is essentially
    outside the scope of Unicode. The problem of how to do
    various kinds of orthographic and/or linguistic markup and
    highlighting in complex scripts while not arbitrarily breaking
    the rendering itself is an issue for negotiation between
    the markup protocols and the text rendering engines and fonts.

    In an earlier post:

    > So, to promote Unicode usage, in a community, which partly sees
    > ISCII unification as a conspiracy against the Dravidian languages,
    > it would be very helpful to demonstrate, that everything that can
    > be done with the legacy encodings, can also be done using Unicode.

    This is a bit off down the garden path, though. As the discussion
    in this thread has made clear, we are talking about behavior
    above the level of the representation of the plain text content.

    First of all, it is quite evident that the same plain text
    content as represented in TSCII can also be represented in
    Unicode. That is the sufficiency test that has to be applied
    to the Unicode Standard.

    Against that, one balances the following pros and cons:

    A. Con. It is more difficult to get browsers using Unicode to
    take HTML span markup (color or whatever) of Tamil consonants
    to render as expected when dealing with left-side (reordrant)
    Tamil vowels or the two-part vowels. Because TSCII uses
    visual order, such behavior is much more straightforward in
    these particular cases.

    B. Pro. It is much easier to get collaters to behave correctly
    for Tamil data when dealing with left-side or two-part vowels,
    because they are stored in logical order and do not add
    complications on top of the already difficult issues of
    syllable weighting for Tamil or other languages using Indic
    scripts.

    > Having an 'invisible consonant' to call for rendering of the vowel sign
    > in isolation (and without the dotted circle), would also help the limited
    > number of cases where the styled single character is needed - but in
    > a rather hackish way.

    That is what the SPACE as base character is for. If some renderers
    insist on rendering such combinations with a dotted circle glyph,
    that is an issue in the renderer -- it is not a defect in the
    encoding standard for not having a way to represent the vowel
    sign in isolation.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Dec 08 2003 - 19:58:56 EST