RE: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)

From: Peter Constable (
Date: Thu Sep 01 2005 - 19:13:39 CDT

  • Next message: Christopher Fynn: "Re: [indic] Unicode Processing Requirements for Tamil"

    > From: [] On Behalf Of
    > Richard Wordingham

    > What should one do to get superscript (and ideally also subscript) digits
    > supported in Tamil text? Section 9.6 Paragraph 2 of the Unicode Standard
    > (from 4.0) says...

    How obvious! Latin digits are part of the Tamil script. How could we have missed that?

    > However, combinations such as பெ⁴ௗ /bhau/ U+0BAA U+0BC6 U+2074 U+0BD7
    > and
    > பெ₄ௗ /bhau/ U+0BAA U+0BC6 U+2084 U+0BD7 do not render properly on
    > Windows
    > XP - the dotted circle appears before the final element of the compound
    > vowel.

    Sure, because (unless you happen to notice this bit of text buried in the standard), the Latin superscript digits are treated as *not* being part of the same script run, and so the cluster is broken, etc.

    > How would you recommend the Unicode Standard be strengthened so that
    > Microsoft feels obliged to support the superscipts and subscripts in
    > combination with non-conjoined follwoing vowels?

    I don't think making the Standard stronger is an issue here. It's more a matter of users identifying a need, and the intended behaviour being clear to implementers.

    On the first point, you have now brought this to our attention, though given that users have been working with our implementation for Tamil ever since the Windows 2000 beta (six? years) and nobody has mentioned this until it is brought up now by (IIUC) a casual user of Tamil, it's not obvious to me that supporting this should be a particular priority. I'd want to know that regular users of Tamil are impacted significantly.

    On the second point, I'd want to see samples of this shown in running text so that I can see how its really used. And then there's the matter of encoded representation, which the Standard really doesn't clarify. You suggested sequences of the form

    < 0BAA, 0BC6, 2074, 0BD7 >


    < cons, pre-matra, sup_digit, post-matra >

    But it seems to me that should really be

    < cons, sup_digit, matras... >

    There's also the question of how many of the digits are needed, but I gather it's just 2, 3 and 4 (to fill out the four-way contrast for a given point of articulation).

    > I think mention of subscripts should be added,

    Stop right there. If this is your invention, I'm not interested. Provide evidence of a user community before you ask for subscripts.

    So, the long and short as far as MS is concerned is (i) we're aware of a potential need, (ii) we've nothing to indicate that there's much user demand and that this needs to be a priority, and (iii) clarification of the encoding spec would be needed before we could consider any change.

    Peter Constable

    This archive was generated by hypermail 2.1.5 : Thu Sep 01 2005 - 19:17:50 CDT