L2/18-054 Title: The two ways to represent Tamil Shri Author: Roozbeh Pournader Date: 2018-01-24 Proposal ======== 1. Document in the Core Spec that two different sequences are used to represent the Tamil word Shri, and conforming applications need to render them identically and treat them as practically equivalent. 2. Consider the implications of this for identifier- and security-related specifications, such as UAX #31, UTR #36, UTS #39, and UTS #46. Recommend solutions and/or document this issue in those specifications. 3. Update UTS #10 and/or CLDR collation data, so that the sequences would be sorted next to each other. Background ========== There are two ways to represent the Tamil word Shri in Unicode. On latest versions of platforms such as Android and Windows, both result in exactly the same output: U+0BB6, U+0BCD, U+0BB0, U+0BC0 SHA, VIRAMA, RA, II ஶ ◌் ர ◌ீ = ஶ்ரீ U+0BB8, U+0BCD, U+0BB0, U+0BC0 SA, VIRAMA, RA, II ஸ ◌் ர ◌ீ = ஸ்ரீ The first one is the sequence sancioned by Unicode, while the second one is an older representation still very commonly used on the internet. The Tamil FAQ at says: Q: What is the mapping for TSCII grantha ligature 0x82 SRI? A: Prior to Unicode 4.1, the best mapping is to the sequence . Unicode 4.1 in 2005 added the character U+0BB6 TAMIL LETTER SHA and as a consequence, the mapping should be updated to . A quick analysis based on a large sample of Google's web corpus shows that the non-sanctioned sequence happens on around 11.5% of publicly accessible Tamil HTML pages, while the sanctioned sequence happens on around 0.5% of them, a ratio of about twenty to one. The ratio holds for a sampling of HTML pages in any language containing either sequence. This means that the Unicode-sanctioned sequence has not gained enough traction, even though it's been the recommended way to encode the word for since 2005. Unicode should recognize that both sequences are are common representations, and although one would be recommended, implementations should be prepared the encounter the older representation and be able to treat them as equivalent.