Re: Public Review TechnicalReport: #33, "Unicode conformance model"

From: Philippe Verdy (
Date: Thu Jan 11 2007 - 20:10:09 CST

The section 4 related to areas of conformance may need further additions, notably regarding the definition of "Unicode coverage" (described in subsection 4.5 for fonts); the same can be said about section 5 for levels of supports (notably subsection 5.1 for repertoire coverage).

The main issue here is related to the fact that there is still no explicit coverage rules regarding the needed conjuncts and contextual forms that will be absolutely required to properly cover languages using this repertoire, because all is based only on the repertoire of single abstract characters.

There are some information in the UTS specific chapters describing each script, but really, the UTS should describe a list of strings that need a special treatment, notably for rendering.

Unicode has initiated the concept of "named character sequences", but the associated repository is still far behind what is really needed to exhibit a correct coverage of a script, even if an implementation is conforming regarding the set of codepoints which it supports correctly.

More complete resources are not found in Unicode itself, but in other standard specifications, notably in the OpenType font specifications, even though these definitions would have broader use than just for rendering. Some parts however are described in Unicode itself: [Bidi], joining types.

But shaping rules do affect other aspect of text processing, especially selection, line break properties, identifier properties, which do help to give or preserve the semantics of substring with multiple codepoints. The most interesting aspect behind this is the concept of unbreakability: if a substring is unbreakable, it's because it carries a specific semantic that is unique to the group and not the mere sum or aggregation of properties of its indidividual character components.

As soon as such entity gets its own semantic identity, comes the time where this substring may have distinct visual representation (which will then require specific encoding for this identity to be maintained). The problem is that many implementations (for example fonts) may still conform to Unicode, to its character model, and to the common ISO/CEI 10646 repertoire, without being usable even inthe script forwhich it was designed, generating then incompatibilities, or pushing the final users to create alternate representations with different encodings to substrings having the same indetended identity.

So how can this be solved? Conformance levels are clearly the culprits because they are laggingfarbehingwhat is really needed to provide correct coverage. The current solutions then don't come from the UTS itself, but from other areas, such as existing proprietary implementations, or other national standard bodies that have made reference implementations (for example reference fonts, especially for complex scripts like Indic ones): being conforming to Unicode is then not enough, and users have to seek for other conformance levels, citing them in reference (for example, conformance to Unicode *and* to the Inscript input method, or to Unicode *and* TIS-620, sometimes also with a reference date).

There are still no single unifying standard that encompasses all the needed rules to properly support the intended languages that are theoretically encoded and standardized in Unicode.

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:55:40 CST