RE: Transcoding Tamil in the presence of markup

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 06 2003 - 18:53:57 EST

  • Next message: John Delacour: "Re: Transcoding Tamil in the presence of markup"

    Christopher John Fynn writes:
    > In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs
    > IE is probably treating a base character and any dependent
    > vowels as a single
    > unit. Since in some fonts a base character + combining vowel
    > mark might be
    > displayed by a single ligature glyph, it makes sense to apply the
    > formatting of
    > a base character to any dependant combining characters as well.
    >
    > In Mozilla you may be completely breaking the font lookups by separately
    > formatting the different parts of a conjunct.
    >
    > In legacy glyph based Tamil encodings there was a simple one-to-one
    > correspondence characters and glyphs so it is straightforward to apply
    > different formatting to different characters.

    Still this is an interesting problem: some texts for example want to
    exhibit some diacritics added to a base letter with a distinct color,
    notably in linguistic texts related to grammar or orthography.

    So for example you could want to exhibit the difference between the two
    French words "désert" and "dessert" by coloring the accent of the first
    word or the second s of the second; or even more accurately between
    "bailler" (concéder un bail, des baux) and "bâiller" (ouvrir en grand)
    where the presence or absence of the circumflex on letter 'a' is
    necessary to reflect the difference of both meaning and pronounciation.

    However, this is not a problem of Unicode itself, but of the rich-text
    format used to add style to a given text. In Unicode (and even in HTML
    and SGML), a letter 'a' followed by a circumflex is canonically equivalent
    to the composed latter 'a' with a circumflex. However if you add tags
    between a base letter and its diacritics, you create separate texts and
    you then have a defective combining sequence in the second string
    starting with the circumflex.

    For Unicode, this circumflex will logically attempt to create a
    combining sequence with its previous HTML or SGML or XML tag. This
    will break many parsers that use the Unicode rules when handling files
    encoded with a Unicode encoding scheme like UTF-8.

    Creating a text that use this HTML "feature" is very hazardous, as the
    interpretation and rendering of defective combining sequences is
    implementation-specific (an application may choose to render the
    diacritics with a base dotted circle glyph, or may display them with
    an base empty glyph, or associate the defective combining sequence with
    the previous combining sequence, or may just be unable to render this
    sequence, as the previous combining sequence may not be accessible in
    the current context of rendering).

    If one want really to add style to diacritics only, it's not in
    Unicode that you'll must search a solution, but in the styling or
    tagging language itself (but defining such a style rule would be
    extremely tricky, and adding this with intermediate tags is not
    conforming to the W3C recommandation for separation between text and
    styles). So that's an interesting question to submit to the W3C for
    its CSS specification... I think that Unicode will not allow you to
    define anything else.

    For now you can use a conforming solution that consists in a HTML
    code like this (here to render the circumflex above a in red):

            a<span style="position: relative; x: -6pt; color: red;
            ">&nbsp;&#x302;</span>

    or better with a style sheet:

            <style><!--
            .diac-red {position: relative; x: -6pt; color: red;}
            --></style>
            ...
            a<span class="diac-red">&nbsp;&#x302;</span>

    This code does not contain any defective sequence, and treats the
    diacritic as a separate graphic unit (it is really such if you
    need a style to detach it from the regular text.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com



    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 20:11:12 EST