L2/08-251 Date/Time: Sun Jul 6 09:33:06 CDT 2008 Contact: mattias.ellert@fysast.uu.se Name: Mattias Ellert Report Type: Submission (FAQ, Tech Note, Case Study) Opt Subject: Proposal for Latin Cyrillic disunification Proposal for Latin Cyrillic disunification In the first release of ISO 10646 / Unicode there were several cross script unifications between Latin and Cyrillic. Over time most of these have been judged to be inappropriate and several disunifications were done. In Unicode 1.1 Ӕ U+04D4, ӕ U+04D5, Ә U+04D8, ә U+04D9, Ӡ U+04E0, ӡ U+04E1, Ө U+04E8 and ө U+04E9 were disunified from Æ U+00C6, æ U+00E6, Ə U+018F, ə U+0259, Ʒ U+01B7, ʒ U+0292, Ɵ U+019F and ɵ U+0275 respectively. Though the cyrillic letters were at this point added with decompositions to the Latin letters in the unicode datafile. In Unicode 2.1.5 these decompositions were removed completing the disunification. In Unicode 5.1 Ԛ U+051A ԛ U+051B Ԝ U+051C and ԝ U+051D were disunified from Q U+0051, q U+0071, W U+0057 and w U+0077 respectively. The reason for doing this cross script unification was to remove the possibility for duplicate encodings for the same (not so frequent) characters, which admittedly is a security concern. However, this possibility exists already in abundance using very frequent characters, in e.g. A U+0041, Α U+0391 and А U+0410. Due to this abundant possibility security sensitive applications will flag mixed script usage as errors. E.g. most web browsers will display a mixed script hostname component using its xn--form. This makes it difficult when the mixed script encoding is mandated by the standard due to cross script unification, since such exceptions are not easily codified and usually completely overlooked by mainstream applications. This was also the reason why the last set on disunifications was approved in Unicode 5.1. There are still a few, admittedly rare, characters that suffers from Latin Cyrillic cross script unification. These are unified the other way than the previous examples, i.e. the encoded Cyrillic character is used in Latin context. The characters are: LATIN LETTER CAPITAL LETTER TONE THREE LATIN LETTER SMALL LETTER TONE THREE LATIN LETTER CAPITAL LETTER TONE FOUR LATIN LETTER SMALL LETTER TONE FOUR These have been unified with З U+0417, з U+0437, Ч U+0427 and ч U+0447 respectively, due to the fact that they share identical glyphs. However, in this case the reasons for unification is very weak since, unlike the previous examples, the letter when used in the Latin script does not represent the same or a similar phonetic value as its Cyrillic counterpart. Instead these four letters form a set of tone characters together with the already encoded Ƨ U+01A7, ƨ U+01A8, Ƽ U+01BC and ƽ U+01BD. These eight characters are letterized versions (capital and small) of the numbers 2, 3, 4 and 5. That the numbers 3 and 4 resembles the cyrillic З and Ч is really not a good reason for doing a cross script unification (though admittedly was handy in the days of lead type setting, but that is an other story). If encoded, these letters should collate together with the already encoded members of the set: 01A8 ; [.1428.0020.0002.01A8] # LATIN SMALL LETTER TONE TWO 01A7 ; [.1428.0020.0008.01A7] # LATIN CAPITAL LETTER TONE TWO XXXX ; [.1429.0020.0002.XXXX] # LATIN SMALL LETTER TONE THREE XXXX ; [.1429.0020.0008.XXXX] # LATIN CAPITAL LETTER TONE THREE XXXX ; [.142A.0020.0002.XXXX] # LATIN SMALL LETTER TONE FOUR XXXX ; [.142A.0020.0008.XXXX] # LATIN CAPITAL LETTER TONE FOUR 01BD ; [.142C.0020.0002.01BD] # LATIN SMALL LETTER TONE FIVE 01BC ; [.142C.0020.0008.01BC] # LATIN CAPITAL LETTER TONE FIVE