Restrictions on base characters of variation sequences (L2/20‑244)

L2/20-247
Date/Time: Mon Sep 28 14:35:19 CDT 2020
Name: Charlotte Buff
Subject: Restrictions on base characters of variation sequences (L2/20‑244)
Document L2/20‑244 (Lindenberg, “Variation sequences for combining marks”)
proposes that the definition of variation sequences be changed so that all
characters that ⓐ have a canonical combining class of 0 and ⓑ do not
canonically decompose may serve as base characters to which variation
selectors can be applied. However, this new definition – and also in fact
the present definition used by the Unicode standard – is too loose, as even
within the specified restrictions, there are still a couple of characters
which would cause normalisation‐related problems if they supported variation
selectors: Those that are the trailing codepoints in other codepoints’
(reversible) canonical decompositions.

Consider for example U+09BE ◌া BENGALI VOWEL SIGN AA, which is a spacing
mark with ccc=0 that does not decompose and would therefore theoretically
allow for standardised variants under both the old and the new rules.
However, if U+09BE occured directly succeeding U+09C7 ◌ে BENGALI VOWEL SIGN
E, normalisation forms C and KC would combine them into U+09CB ◌ো BENGALI
VOWEL SIGN O. If U+09BE had had an accompanying variation selector, it would
now apply to U+09C7 instead, forming an invalid sequence.

As of Unicode 13.0.0, the following codepoints are affected by this issue:

U+09BE BENGALI VOWEL SIGN AA
U+09D7 BENGALI AU LENGTH MARK
U+0B3E ORIYA VOWEL SIGN AA
U+0B56 ORIYA AI LENGTH MARK
U+0B57 ORIYA AU LENGTH MARK
U+0BBE TAMIL VOWEL SIGN AA
U+0BD7 TAMIL AU LENGTH MARK
U+0CC2 KANNADA VOWEL SIGN UU
U+0CD5 KANNADA LENGTH MARK
U+0CD6 KANNADA AI LENGTH MARK
U+0D3E MALAYALAM VOWEL SIGN AA
U+0D57 MALAYALAM AU LENGTH MARK
U+0DCF SINHALA VOWEL SIGN AELA-PILLA
U+0DDF SINHALA VOWEL SIGN GAYANUKITTA
U+102E MYANMAR VOWEL SIGN II
U+1161..U+1175 HANGUL JUNGSEONG A..HANGUL JUNGSEONG I
U+11A8..U+11C2 HANGUL JONGSEONG KIYEOK..HANGUL JONGSEONG HIEUH
U+1B35 BALINESE VOWEL SIGN TEDUNG
U+11127 CHAKMA VOWEL SIGN A
U+1133E GRANTHA VOWEL SIGN AA
U+11357 GRANTHA AU LENGTH MARK
U+114B0 TIRHUTA VOWEL SIGN AA
U+114BA TIRHUTA VOWEL SIGN SHORT E
U+114BD TIRHUTA VOWEL SIGN SHORT O
U+115AF SIDDHAM VOWEL SIGN AA
U+11930 DIVES AKURU VOWEL SIGN AA

This list may expand in the future as new canonically decomposable
characters are encoded. However, existing characters cannot become affected
in a later version of the standard because a new character decomposing into
already assigned codepoints would automatically be composition-excluded.

Regardless of whether the rules for variation sequences will be changed or
not, the aforementioned characters must be forbidden from receiving
standardised variants, either implicitly (by simply never defining variants
for them) or explicitly by changing the wording of section 23.4 of the core
standard to specifically exclude them and potential future characters like
them.