Criteria for Disunifying Diacritics L2/05-172

Author: Asmus Freytag
Date: 2005-08-01
Action: For review at UTC#104
Status: Proposed as possible liaison contribution form UTC to WG2

Summary

This document presents a proposed set of criteria for deciding when to encode separate script-specific diacritics. These criteria are proposed to be added to the WG2 document on Principles and Procedures. In making the case for the proposed criteria, this document provides additional background information, as well as a discussion of the purpose of the 'common' diacritical marks and a very brief discussion of security related issues as they apply to the encoding of diacritics.

Background

The Unicode Standard states about the Combining Diacritical Marks block: “The combining diacritical marks in this block are intended for general use with any script.” This text is sometimes misunderstood as if it was intended to be a normative directive that these diacritical marks should be used with all scripts.

It is indeed true that in the Unicode Standard (and 10646) every diacritic mark may be applied to any base character, but this does not imply, or require, that such application lead to a graphically meaningful result, or that any particular combination of base character and diacritic will be supported by applications. It is merely a general principle about the use of the standard, namely that such sequences are not illegal as they would be in ISO 6937.

Note that Unicode/10646 has encoded many script specific combining marks, even where they bear a superficial graphical similarity to a generic diacritical mark. Recently encoded examples include U+0659 ARABIC ZWARAKAY, U+065A ARABIC VOWEL SIGN SMALL V ABOVE, U+065B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE, and U+065C ARABIC VOWEL SIGN DOT BELOW, which were deliberately disunified from U+0304 COMBINING MACRON, U+030C COMBINING CARON, U+0302 COMBINING CIRCUMFLEX ACCENT, and U+0323 COMBINING DOT BELOW, respectively.

U+135F ETHIOPIC COMBINING GEMINATION MARK was not unified with U+0304 COMBINING DIAERESIS. Similarly, U+05C5 HEBREW MARK LOWER DOT was not unified with U+0323 COMBINING DOT BELOW.

And the long-encoded U+05B5 HEBREW POINT TSERE, U+0738 SYRIAC DOTTED ZLAMA HORIZONTAL, as well as the more recently encoded U+0CBC KANNADA SIGN NUKTA, have all been distinguished from U+0324 COMBINING DIAERESIS BELOW despite the fact that in appearance they simply consist of two side-by-side dots placed below a character. This superficial graphic similarity was not considered sufficient reason to justify a unification.

Examining these cases leads to the formulation of the following criteria, which should be included in WG2's Principles and Procedures document.

Criteria for Disunifying Combining Diacritics

A number of criteria may be considered when deciding whether a proposed combining diacritical mark for a particular script should be unified with an existing encoded combining diacritical mark. Among criteria which would favor a decision to disunify when encoding are:

the mark has been borrowed from another script, but has been significantly modified to fit with the ductus of the borrowing script
the mark forms part of a set of marks in the script (for example a set of tone marks), but only some members of the set can be unified with existing marks
the mark has a specific function unrelated to the generic diacritical mark (e.g. use of the mark as a vowel sign as opposed to the use of a similar-shaped mark as a diacritic). In such case the two uses might also require explicit differences in their character properties.
the range of glyphic appearance is markedly different from the generic diacritical mark
the layout behavior is different and requires different support

The more of these criteria are satisfied, and the stronger the degree to which each is satisfied, the stronger the case for encoding a script specific diacritical mark. This is not a matter of a rule that deterministically yields a 'yes/no' decision, but is rather a question of 'degree', which can then form a basis for a proper judgment of the encoding question.

In general, these criteria are not much different from those for assigning script specific punctuation in particular, or for disunifying characters in general.

Purpose of the 'Common' Combining Diacritical Marks?

The combining diacritical marks in the two blocks in the standard for generic combining marks are primarily for use with the European scripts derived from Greek -- including Latin and Cyrillic. These scripts are alphabets, and share a general typographical model, including the common application of diacritic marks to indicate accents and pronunciation modifications of letters. Adaptations of these scripts for specialized notational systems (e.g., phonetic alphabets or Western mathematics) and for orthographies of non-European languages, also make heavy use of these combining diacritics.

While other scripts, notably the Arabic script, also make very heavy use of diacritical marks and other kinds of annotation marks, scripts not directly derived from Greek have their own history of diacritic development. Therefore the various dots and other marks used with them cannot automatically identified with "common" combining diacritical marks based on graphic form alone. The exception to this rule would be a case where a common diacritical mark has been explicitly borrowed (usually from the Latin script) for use in an unrelated script.

In the general case it would be wrong to presume that the application of common diacritical marks will make much sense (or be reasonably supported by applications) when applied to characters from different typographical traditions, such as CJK ideographs or Sumero-Akkadian syllables, for example. The determination of what are "reasonable" combinations should be guided, in large part, by established typographical practice for each script. The proposed criteria are intended to help guide this decision making process.

Security Issues

A concern has been raised that disunifying characters introduces additional possibilities for creating strings that look confusably similar or even identical, but contain different character codes. The process of using a look-alike string to trick users into revealing passwords etc. is called 'spoofing.' On the one hand, limiting disunifications that are based on function, not appearance, tends to limit the possibilities for spoofing. On the other hand, the number of confusably similar looking, but otherwise not unifiable characters in Unicode and 10646 is already very large.

This means that while security concerns must be duly considered when deciding to disunify characters, a blanket prohibition on encoding characters, in particular diacritics, that could be confused by some users with already existing characters is not very useful.

On the contrary, judicious encoding of script-specific diacritical marks could actually be helpful, as it allows a security conscious implementation to insist that a string be composed solely of characters from a given script. If a mixed script string is encountered (a very common spoofing strategy) it could then be either flagged, or disallowed. Any use of script-specific diacritical marks with say, Latin letters, could then be positively identified as attempted spoofing, and appropriate security measures to be taken.

For further details see UTS #36, Unicode Security Considerations.