L2/08-220

Date: Tue, 13 May 2008
Source: Mark Davis
Subject: Deprecated character proposal


=====

Peter and I had the following action
 

B.14.2  Deprecated characters [Davis, L2/08-018]

[114-A106] Action Item for Mark Davis, Peter Edberg: Prepare a proposal on deprecated characters for the next UTC meeting. See L2/08-018.

The proposal is to post the following (with any amendments from the meeting) plus Table 1 as a PRI (that is, excluding Table 2 and 3).

====

The characters listed in Table 1 below are discouraged or strongly discouraged in the Unicode Standard, either in the text or in the charts. The mechanism for making that status known to implementers is by giving them the property Deprecated, so the proposal is to add them to the set of characters with that property. If after discussion and review, any of these should not be given the property Deprecated, then the phrasing about their being discouraged should be removed from the text of the Standard and charts.

As part of this proposal, we would add text that makes the following points more clearly.
Table 1. Characters to be given the Deprecated property

The list of proposed characters is broken down according to their status vis-a-vis NFC and NFKC.

Allowed by both NFC and NFKC:

0953 ( ॓ ) DEVANAGARI GRAVE ACCENT
0954 ( ॔ ) DEVANAGARI ACUTE ACCENT
0F07 ( ༇ ) TIBETAN MARK YIG MGO TSHEG SHAD MA
17A4 ( ឤ ) KHMER INDEPENDENT VOWEL QAA
17D8 ( ៘ ) KHMER SIGN BEYYAL
20A4 ( ₤ ) LIRA SIGN

Allowed by NFC but not NFKC:

0149 ( ʼn ) LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
0F77 TIBETAN VOWEL SIGN VOCALIC RR
0F79 TIBETAN VOWEL SIGN VOCALIC LL

Allowed by neither NFC nor NFKC:

0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS
037E ( ; ) GREEK QUESTION MARK
0387 ( · ) GREEK ANO TELEIA
0F73 TIBETAN VOWEL SIGN II
0F75 TIBETAN VOWEL SIGN UU
0F81 TIBETAN VOWEL SIGN REVERSED II
2126 ( Ω ) OHM SIGN
212A ( K ) KELVIN SIGN
212B ( Å ) ANGSTROM SIGN
2329 ( 〈 ) LEFT-POINTING ANGLE BRACKET
232A ( 〉 ) RIGHT-POINTING ANGLE BRACKET

For comparison, the Deprecated characters as of U5.1 are:

U+0340 ( ̀ ) COMBINING GRAVE TONE MARK
U+0341 ( ́ ) COMBINING ACUTE TONE MARK
U+17A3 ( ឣ ) KHMER INDEPENDENT VOWEL QAQ
U+17D3 ( ៓ ) KHMER SIGN BATHAMASAT
U+206A (  ) INHIBIT SYMMETRIC SWAPPING
U+206B (  ) ACTIVATE SYMMETRIC SWAPPING
U+206C (  ) INHIBIT ARABIC FORM SHAPING
U+206D (  ) ACTIVATE ARABIC FORM SHAPING
U+206E (  ) NATIONAL DIGIT SHAPES
U+206F (  ) NOMINAL DIGIT SHAPES
U+E0001 (  ) LANGUAGE TAG
U+E0020 (  ) TAG SPACE
...
U+E007F (  ) CANCEL TAG

===================================================

Table 2. Current text on 'deprecated'

We currently have the following text on deprecated: ===================================================

Table 3. Remarks from Lofting

Appended is some feedback from Peter Lofting for consideration regarding the Tibetan characters.
 
[0] History of deprecation
Discouragement notices have been in place in the Tibetan Unicode charts since 3.0
(a) for di-graphs the notice: "use of this character is discouraged"
(b) for tri-graphs the notice: "use of this character is strongly discouraged"
 
[1] head letter
0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
I would like to know the basis for deprecating this head mark. 
It exists in documents and a canonical decomposition is not possible. Why single it out?
 
[2] Sanskrit accents
0F73 TIBETAN VOWEL SIGN II -> canon decomp 0F71 0F72
0F75 TIBETAN VOWEL SIGN UU -> canon decomp 0F71 0F74
0F77 TIBETAN VOWEL SIGN VOCALIC RR -> compat decomp 0FB2 0F81
0F79 TIBETAN VOWEL SIGN VOCALIC LL -> compat decomp 0FB3 0F81
0F81 TIBETAN VOWEL SIGN REVERSED II -> canon decomp 0F71 0F80
 
Canonical decomposition is not the only relationship that these characters have: The Tibetan double vowel marks in the list are used for representing Sanskrit transliterated into Tibetan and enable the disambiguation of such text from Tibetan contraction sequences for both shaping and semantic processing. This is an important function and these code points are not therefore redundant. 

They also map 1:1 to Sanskrit vowels in the Indic code pages.  e.g. 0F73 TIBETAN VOWEL SIGN II --> 0908 DEVANAGARI LETTER II

=====

I would also add that the selection of these candidates is consistent only with their awkwardness for shaping machinery rather than consistent application of occams razor. There are other less useful - and even plain wrong code points in the block that could much more reasonably be decomposed or deprecated, but they appear to have escaped attention because they are "well behaved" complete precomposed stacks. e.g.

0F00 TIBETAN SYLLABLE OM --> decomp 0F68  0F7C  0F7E
This one is in as a 1:1 mapping to Devanagari OM at 0950, but can be decomposed without loss of representation. It is just a precomposed display form.

0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA  --> decomp 0F60 0F74 0F7F  0F82 
This and 0F03 are only two instances of an open-ended class of many Terma head marks. They only make sense in the encoding as generic place-holders for a whole set of marks that could then be represented with variant selector sequences using these two base bytes. 

If a terma mark were encoded then 0F03 could also be decomposed.
Depending on scholarly input, Terma mark might be represented as a display variant of 0F82, in which case...
0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA --> decomp 0F60 0F74 0F7F  <0F82 terma mark>

In the plain wrong department, the 'Digits minus half' section needs correcting. It should read "Half Digits" or some such.

There are 2 key problems:
(i) The slash divides the value into half NOT minus one half.
(ii) The slash can apply to multi-digit sequences e.g. 108<slashed> = 54, etc

As the character names stand they are not wrong as they say HALF FiVE etc. rather than FIVE MINUS ONE HALF. The exception is 0F33 TIBETAN DIGIT HALF ZERO which is a "divide by zero" error which gives the lie to the bad semantic definition. Depending on how this mess is cleaned up these could be candidates for deprecation. The right way to represent these cases is with a separate combining slash mark. These half digits could then be deprecated as display forms; but again a combining slash of variable scope is awkward for both shaping and computation, and is why the corrupted definition was invented in the first place in an effort to avoid such awkward shaping and processing requirements. I expect it will take another 5-to-10 years for other code points in the combining marks block to force this kind of mechanism into being, at which point, this can be corrected "at no extra cost" to implementers.