L2/07-115

Unicode Properties in Character Proposals (draft)

This is a draft of a document concerning character properties which was requested by UTC in action item 110-A99:

110-A99 Make up a set of questions for determining character properties, particularly punctuation. (Cf. 110-A098)

Introduction

Characters in The Unicode Standard have a number of properties, some of which are obvious and easily discovered, and some of which are not. Some properties are automatically assigned (such as Derived Age), others are assigned with ease, implicit in the character name or other information easily supplied by a proposal author. For general information on character proprties, see The Unicode Standard, Chapter 4 (PDF).

For reference, a more-or-less complete list of properties can be found online here:

http://unicode.org/Public/UNIDATA/UCD.html#Properties

The questions and discussion below have been developed to get proposal authors and committee members thinking about, and providing in proposals, the property information that will be needed at the time new characters are published in the standard. For each character in a proposal, the proposal author should think about the character in context, and answer questions about how the character interacts with other characters.

Basic Information

The most basic information required about characters includes Name, Codepoint, and other identity information, such as whether a character goes by more than one name, or can be cross-referenced to another character.

The codepoints are typically assigned by the committees (WG2 and UTC), but if the proposal is for an entire script, it is probably already on the roadmap, and therefore a particular range of codepoints may already have been pre-selected. In other cases, those proposing characters can make recommendations about where the characters should be encoded, but it isn't necessary to do so.

If there are alternative names for a character or characters in the proposal, these should also be discussed, as well as other information about the meanings of names, and similarities in behavior to other characters that are already encoded in the standard.

General Category and Other Properties

Each character is assigned a "General Category". These are documented in the Unicode Character Database (UCD). Typical categories include things such as "letter", "combining mark", "symbol" and so forth. This category must be specified, or suggested, for each character in a proposal.

The file UnicodeData.txt contains a number of categories in a specified layout. It is most helpful if a set of lines emulating the entries file are included in a proposal, for each character in the proposal. For example, the following line is the UnicodeData.txt entry for Greek upper-case Gamma

	0393;GREEK CAPITAL LETTER GAMMA;Lu;0;L;;;;;N;;;;03B3;

The properties in UnicodeData.txt are documented here: http://www.unicode.org/Public/UNIDATA/UCD.html

The discussion below relates to these properties as well as other extended properties that are documented in other files.

Some scripts have case (A/a) if so, it will be necessary to know:

Can the character be used in identifiers, such as domain names or programming language variables? Normally only modern-use letters, marks, and numbers are permitted in identifiers (used, for example, in programming languages, user names, international domain names, etc).

If allowable in identifiers, can it start an identifier, or would it only be used as a non-first character? (Most characters that are allowed in identifiers can be the first character.) Any special handling or considerations should be spelled out.

Is the character an ordinary letter of an alphabet or syllabary (non-CJK ideograph)? Or is it a stand-alone symbol? (For CJK ideographs, see the special section below.)

Is the character a white-space character, or does it cause visible separation between other characters?

Does the character have a numeric value?

Is it a "base letter" or does it combine with letters or symbols?

If it is a combining character:

If this is a punctuation character:

Line breaking behavior can be tricky, but many characters simply behave "just like" some other characters. Is there a character already in the standard that behaves similarly, or identically, to this character in terms of line breaking?

To determine proper line-breaking behavior, one can think of a line of text in a graphic window. As a window is re-sized to be narrower, and words are made to automatically wrap to the next line, how does this character behave?

Can the character be normalized to (or mapped to) another character, or some combination of other characters, either already-encoded, or not-yet encoded in the standard?

Is the character a math or technical operator?

In the context of bidirectional text, how does the character behave? The main issues are directionality of "R" versus "AL". Symbols need to have their directionality specified as L, R, AL, or neutral; and some discussion of this may be required in the proposal, for each such symbol.

Also, special symbols need to be compared to the behavior of other special symbols in bidi, and the directional class of numbers needs to be specified.

What about shaping behavior? In scripts such as Arabic, the shaping classes and behavior need to be explicitly determined for each such letter.

Should the character belong to any of the special categories, such as hyphen, dash, diacritic?

Special Considerations for CJK Additions

Addition of CJK ideographs is usually handled by the Ideographic Rapporteur Group (IRG), but in rare cases, a proposal for CJK characters may be presented to UTC. If the character is a CJK ideograph, it should be assigned properties just like other ideographs, so a whole set of questions are already pre-answered, because it should be assigned most properties identical to all other CJK ideographs. However, there are some other questions:

However, CJK characters will also need to have a lot of associated data, as specified in the Unihan documentation. See: http://www.unicode.org/reports/tr38/ and http://www.unicode.org/Public/UNIDATA/Unihan.html for details.

Collation and Ordering Issues

Characters are often ordered in relation to other characters. For symbols, the default ordering often doesn't matter very much. However, for characters that are part of an alphabet or syllabary, the default order is often quite important. If you are proposing a whole script, the binary order of the proposal is often taken as the first approximation of an expected ordering. If there are reasons why the binary order differs from the expected "native" ordering, these should be justified and spelled out. Otherwise, the characters in the proposal should simply be laid out in a logical, expected ordering. In the case there are two or more orders that occur with some frequency, it is helpful to discuss their differences, and include both orders in the text of the proposal.

If you are proposing additional characters in a script that is already encoded, it is necessary to show where the characters should be sorted in relation to the other characters already encoded. For example, if new syllables in the Vai or Yi scripts are to be added, their binary order (where they are encoded) may be very different from where they should occur in the syllabic order. The proposal needs to specify exactly where the characters should be interpolated in sorting.

Besides the primary and secondary order issues for the letters and digits, the proposal author also needs to provide some information about how "special" characters behavewhether they are simply ignored for collation, or have some special order. That is often the hardest part of coming up with collation table assignments. It may help to think in terms of how such symbols might behave like other characters already encoded.

Draft date: 2007-05-03