L2/18-054

Title: The two ways to represent Tamil Shri
Author: Roozbeh Pournader
Date: 2018-01-24

Proposal
========
1. Document in the Core Spec that two different sequences are used to represent
the Tamil word Shri, and conforming applications need to render them
identically and treat them as practically equivalent.

2. Consider the implications of this for identifier- and security-related
specifications, such as UAX #31, UTR #36, UTS #39, and UTS #46.  Recommend
solutions and/or document this issue in those specifications.

3. Update UTS #10 and/or CLDR collation data, so that the sequences would be
sorted next to each other.

Background
==========
There are two ways to represent the Tamil word Shri in Unicode.  On latest
versions of platforms such as Android and Windows, both result in exactly the
same output:

U+0BB6, U+0BCD, U+0BB0, U+0BC0
SHA,    VIRAMA, RA,     II
ஶ ◌் ர ◌ீ = ஶ்ரீ

U+0BB8, U+0BCD, U+0BB0, U+0BC0
SA,     VIRAMA, RA,     II
ஸ ◌் ர ◌ீ = ஸ்ரீ

The first one is the sequence sancioned by Unicode, while the second one is an
older representation still very commonly used on the internet.  The Tamil FAQ
at <http://www.unicode.org/faq/tamil.html#12> says:

  Q: What is the mapping for TSCII grantha ligature 0x82 SRI?

  A: Prior to Unicode 4.1, the best mapping is to the sequence <U+0BB8, U+0BCD,
  U+0BB0, U+0BC0>. Unicode 4.1 in 2005 added the character U+0BB6 TAMIL LETTER
  SHA and as a consequence, the mapping should be updated to <U+0BB6, U+0BCD,
  U+0BB0, U+0BC0>.

A quick analysis based on a large sample of Google's web corpus shows that the
non-sanctioned sequence happens on around 11.5% of publicly accessible Tamil
HTML pages, while the sanctioned sequence happens on around 0.5% of them, a
ratio of about twenty to one. The ratio holds for a sampling of HTML pages
in any language containing either sequence.

This means that the Unicode-sanctioned sequence has not gained enough traction,
even though it's been the recommended way to encode the word for since 2005.
Unicode should recognize that both sequences are are common representations,
and although one would be recommended, implementations should be prepared the
encounter the older representation and be able to treat them as equivalent.