L2/02-381

Contribution to the SC22 ad-hoc meeting on characters

Source: The Unicode Consortium
Date: August 15, 2002

The Unicode Consortium seeks the dialogue with SC22 in a number of areas where the interests of character encoding and programming languages intersect. These can be summarized as:

UTF-16 data type

Identifiers

Character Properties

The following present some more details and background on each of these areas.

1. UTF-16 data type

See document L2/02-107. Appended at the end.

2. Identifiers

Identifier definitions are used in formal programming languages, but also in other areas, such as domain names and markup languages. As each of these areas is extending to the use of internationalized indentifiers, a similar set of issues emerges. The repertoire of the Universal Character Set (Unicode/10646) is growing dynamically and in large chunks. From a start with 35,000 characters ten years ago, the UCS now contains over 94,000 characters. From the perspective of defining identifier descriptions it has presented a movable target.

There are three main strategies for expanding an identifier definition to cope with the Universal Character Set:

Strategy 1: Limited, fixed list

A Limited, fixed list of characters is allowed in identifiers. No extensions, so that future character additions are disallowed.

The first strategy maximizes interoperability, at the expense of excluding some user groups whose (human readable) languages require characters that are not in the limited list.

Strategy 2: Fixed list with periodic extensions

A fixed list of characters is allowed in identifiers; Periodic extensions of the list as new characters are encoded.

For consistency the periodic extensions should be done by following a generalized rule, i.e. by making sure that as characters of the same character properties get added, they are treated the same for identifiers. The easy way to do this is by reference to the derived Unicode character properties: Identifier_Start, and Identifier_Extend. Even with thus leveraging existing work, the main drawbacks of this method are its high maintenance costs, coupled with the problems of interoperating between different versions.

Strategy 3: Fixed exclusion list

All characters, except a (small) fixed list are allowed in identifiers; including, by implication, future additions.

This third approach is being considered by W3C because it would require no maintenance, allow maximal expressiveness of human readable identifiers, while achieving very good interoperability between versions. However, as a consequence of allowing unassigned characters, it will be possible to create identifiers that are not "words" in any language.

For purposes of matching allowing unassigned characters is not a tremendous problem, but some issues must be considered. Case insensitive comparisons of identifiers can only be performed on characters for which the case mappings are defined. However, bicameral scripts are the exception and almost all new characters to be added can be expected to be caseless. A similar issue applies to combining characters and normalization. In environments where all identifiers are required to be normalized before parsing (so that a downlevel parser never needs to normalize uplevel characters) this issue can be avoided (Early Uniform Normalization).

What the Unicode Consortium can contribute:

The Unicode Technical Committee (UTC) analyses all new characters for the purpose of assigning a diverse set of character properties (see section 3). This work is tightly coupled with the task of identifying new characters for encoding, as the characters properties have bearing on the encoding decision. The Unicode Character Database should therefore be considered as the source for the data for approach 2. Unicode also provides a definition of normalization (http://www.unicode.org/reports/tr15).

Responsibility of programming languages committees:

Decide which of the strategies above is appropriate for some or all programming languages. Further perform tailoring of identifier syntaxes to account for particular peculiarities of given formal languages ("_", "-", "@", etc.). Decide issues of whether non-spacing marks should be allowed in identifiers, and the resulting issue of canonical equivalence. Normalization and identifiers.

SC22 should further consider issues of stability of identifier definitions, of interoperability between versions of a language, between fixed lists, based on different versions of the Universal Character Set, and between different languages.

Summary

None of the Unicode members expects the UTC to be the forum that would establish the particular identifier syntax rules that apply in C or C++ or COBOL (or XML, ASN.1, SQL, or anything else). Unicode fully recognizes that is a concern of each relevant standardizer. What the UTC establishes are consistent, extensible rules for Identifier_Start and Identifier_Extend properties for all Unicode characters, based on the in-depth knowledge of how these characters are used in their respective scripts. Those rules can be adapted and customized, as required for particular formal syntaxes.

Recommendations for programming languages (and markup languages, such as XML) should be based on that Unicode analysis -- and then should make principled decisions regarding whether identifier syntax should be permanently pegged at some particular release version of Unicode (e.g. Unicode 3.0), should accept new repertoire as it is added in future versions, or should simply take the position that all characters are allowed except for a deliberate exception list (also tied to a particular release version of Unicode). There are tradeoffs in identifier and maintenance stability, as well as interoperability considerations as discussed above.

But what the formal language committees should not be doing is wading through 94,000+ Unicode characters trying to establish all their properties and sorting them into categories for recommended Identifier_Start and Identifier_Extend classes, since that is the primary expertise of the UTC. Instead, programming language committees are encouraged leverage the existing the large body of existing work by pointing to the Unicode table(s), and then reviewing the particular extensions, limitations, or customizations (treatment of "_", "-", "@", other syntax characters) that apply to their particular standards

3. Character Properties

Character properties can be very complex and multifaceted. They are tied closely to issues considered in the encoding of characters. (See the attached document on the way the UTC defines and updates character properties). Since the Universal Character Set is larg (94,000+ characters) and still growing, the task of correctly defining and maintaining character properties is decidedly non-trivial.

Some of the legacy properties dealt with in legacy formal languages and runtime environment (C, C++, also POSIX, etc.) such as isLetter(), isPunct(), etc., are not readily extensible to the Universal Character Set. In part, this is because they make distinctions that are natural for some commonly used scripts, but don't translate well to other types of scripts. This is related to the problem of identifier classes, as well, see section 2 on Identifiers.

The functions for which the legacy properties were created in the first place include: to assist in identifier parsing; text element and line breaking behavior; parsing numbers; calculating display cells; etc. These are partially extensible to the Western European and East Asian cases, but break down when applied to the Universal Character Set as a whole.

There are alternative approaches, by defining appropriate behavior that depends on character properties. This requires a much deeper and more sophisticated collection of properties -- not simply extension of a small set that functioned adequately for ASCII and JIS. Such a richer set of properties is maintained by the UTC using an open process, and made available in stable and referencable form in the Unicode Character Database. (see the attached document on the UCD). The set of properties in the Unicode Character Database is derived from the intersection of character coding expertise with demands from a wide variety of users in the IT industry. The process used for maintaining and extending the Unicode Character Database is very responsive to the need of implementers, both for the types of properties defined as well as their assignments.

When SC22 standards need to adapt to the UCS and its requirements, they should make reference to the appropriate values from the Unicode Character Database. This would absolve the SC22 committees of the whole tricky problem of trying to manage and extend character properties, and instead allow them to focus on the issues of how the properties interact with their traditional uses in formal languages (e.g. the definition of identifiers) and how properties for the UCS get surfaced in extensions to their traditional library standard APIs (e.g. ctype.h). While the Unicode Character Database is a de facto Standard, with its own (perhaps peculiar) logic for development, control, publication, feedback, etc., it enjoys widespread acceptance in the IT industry, and its maintaining committee, the UTC, has a demonstrated track record of being able to attract relevant expertise and being able to manage such a complex set of data. By referencing this database, instead of duplicating it, the the formal language standards can leverage the work and ensure that classification of characters is done in a consistent manner.

The Unicode Character Database

The Unicode Character Database (UCD) is a formal part of the Unicode Standard. It consists of a number of data files along with that define character properties and their names, as well as documentation files that explain the organization of the database and the format and meaning of the data in the files. When new characters or properties are added, or any other changes are made, a new version is created. The procedures followed by the Unicode Technical Committee in approving updates to the Unicode Standard and Unicode Character Database can be found at: http://www.unicode.org/unicode/consortium/utc-procedures.html. In addition to the open development process, each update of the Unicode Character Database is subject to a beta review open to the general public.

Referencing the Unicode Character Database

Properties and property values have defined names and abreviations, such as

Property: General_Category (gc)
Property Value: Uppercase_Letter (lu)

To reference a given property and property value, it is sufficient to state:

The property value "Uppercase_Letter" from the "General_Category" property, as defined in Unicode 3.2.0,

and then cite that version of the standard, using the standard citation format that is provided for each version of the Unicode Standard:

Unicode 3.2.0 (March, 2002)

The Unicode Consortium. The Unicode Standard, Version 3.2.0, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/) and the Unicode Standard Annex #28: Unicode 3.2 (http://www.unicode.org/reports/tr28/)

Version numbers for the Unicode Standard consist of three fields: the major version, the minor version, and the update version. For historical reasons, the numbering within each of these fields is not necessarily consecutive. The differences among the fields are as follows:

major	-	significant additions to the standard, published as a book.
minor	-	character additions or more significant normative changes, published as a Technical Report.
update	-	any other changes to normative or important informative portions of the standard that could change program behavior. These changes are reflected in new Unicode Character Database files and an update page.

Character Properties

The following sections give some background on character properties.

The Unicode Standard views character semantics as inherent to the definition of a character, and conformant processes are required to take these into account when interpreting characters. The assignment of character semantics for the Unicode Standard is based on character behavior. For other character set standards, it is left to the implementer, or to unrelated secondary standards, to assign character semantics to characters. In contrast, the Unicode Standard supplies a rich set of character attributes, called properties, for each character contained in it. Many properties are specified in relation to processes or algorithms that interpret them, in order to implement the discovered character behavior.

Character Behavior in Context

The interpretation of some properties (such as the case of a character) is largely independent of context, whereas the interpretation of others (such as directionality) is applicable to a character sequence as a whole, rather than to the individual characters that compose the sequence.

Other examples that require context include the classification of neutrals in script assignments or title casing. The line breaking rules of UAX#14 Line Breaking Properties [LineBreaking] involve character pairs and triples, and in certain cases, longer sequences. The glyph(s) defined by a combining character sequence are the result of contextual analysis in the display shaping engine. Isolated character properties typically only tell part of the story.

Relation of Character Properties to Algorithms

When modeling character behavior with computer processes, formal character properties are assigned in order to achieve the expected results. Such modeling depends heavily on algorithms. In some cases, a given character property is specified in close conjunction with a detailed specification of an algorithm. In other cases, algorithms are implied but not specified, or there are several algorithms can make use of the same general character property. The last case may require occasional differences in character property assignment to make all algorithms work correctly. This can usually be achieved by overriding specific properties for specific algorithms.

When assigning character properties for use with a given algorithm, it may be tempting to assign somewhat arbitrary values to some characters, as long as the algorithm happens to produce the expected results. Proceeding in this way hides the nature of the character and limits the re-use of character properties by related processes. Therefore, instead of tweaking the properties to simply make a particular algorithm easier, the Unicode Standard pays careful attention to the underlying essential linguistic identity of the character. However, not all aspects of a characters identity are relevant in all circumstances, and some characters can be used in many different ways, depending on context or circumstance. Because of this the formal character properties alone are not sufficient to describe the complete range of desirable or acceptable character behaviors.

L2/02-107

To:	US JTC 1 TAG
From:	L2
Date:	2002-02-19
Re:	German NB proposal on UTF-16 datatype

L2 has considered the proposal from the German NB for a new work item to add a UTF-16 data type to the C standard (SC 22 N 3356, L2/02-007). This was discussed in a meeting with C language committee members during the L2 meeting on 2002-02-12. On the basis of that discussion, L2 recommends that the US JTC 1 TAG adopt the following as the US position:

The U.S. NB supports this new work item. Adding a UTF-16 datatype and string literal support to the C standard would greatly benefit implementers of Unicode / 10646 in making use of the C standard.
In particular, the following additions would be technically advantageous:
1. UTF-16 datatype. Exactly 16 bits, to explicitly hold a Unicode / 10646 UTF-16 code unit.
2. UTF-16 string type. Linked explicitly with the UTF-16 datatype, so that static string initialization with UTF-16 data would be easy and explicit.
3. UTF-32 datatype. Exactly 32 bits, to explicitly hold a Unicode/10646 code point (without the cross-platform size ambiguity of wchar_t).
4. UTF-32 string type (optional). Linked explicitly with the UTF-32 datatype. This might be useful, but for most implementations is less important than having the UTF-16 string type.
Regarding the terminology to be associated with any such new datatypes for C, usage of "UTF-16" and "UTF-32" is preferred. The exact form of the names for new datatypes would, of course, be up to the C committee to determine, but names along the lines of "utf16_t", "utf32_t" or the like would be satisfactory.
- It is advisable to avoid any terminological usage involving "UCS-2" and "UCS-4". The term "UCS-2" would be misleading, since it is the fixed-width 16-bit form of 10646, limited only to the BMP, whereas all significant implementations are now moving to the variable-width UTF-16, to get all-plane support for 10646. Use of "UCS-4" is not parallel, and just induces a cognitive matching problem of converting from 4 octets to 32 bits -- which is the more normal concept for a 32-bit datatype. Furthermore, the "16" and "32" are more normal concepts for C programmers dealing with datatype sizes.
The U.S. does not suggest adding any corresponding API's for the standard libraries, to match already existing API's relevant to char and wchar_t string types. Simply making the datatype additions listed in (2) above would meet the essential requirements that vendors have on the language to make their Unicode porting and other tasks simpler and more uniform. API support for Unicode semantics is, at this point at least, more appropriately provided by various third-party add-on libraries.
The U.S. considers it important that other language standards, and in particular, C++, take these issues into account, so that if a new datatype or datatypes are added to C, interoperability with other languages can be maintained as well.