Draft Unicode Technical Report #31

Identifier and Pattern Syntax

Version	2 (draft1)
Authors	Mark Davis ([email protected])
Date	2004-01-23
This Version	http://www.unicode.org/reports/tr31/tr31-2.html
Previous Version	http://www.unicode.org/reports/tr31/tr31-1.html
Latest Version	http://www.unicode.org/reports/tr31/

Summary

This document describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax. It incorporates the Identifier section of Unicode 4.0 (somewhat reorganized) and a new section on the use of Unicode in patterns. As a part of the latter, it presents recommended new properties for addition to the Unicode Character Database.

Feedback is requested both on the text of the new pattern section and on the contents of the proposed properties.

Status

This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Making this document available for public review does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1. Introduction

2. Default Identifier Syntax

3. Alternative Identifier Syntax

4. Pattern Syntax

Acknowledgements

References

Modifications

1. Introduction

A common task facing an implementer of the Unicode Standard is the provision of a parsing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of specifications is provided here as a recommended default for the definition of identifier syntax. These guidelines are no more complex than current rules in the common programming languages, except that they include more characters of different types.

In addition, this document provides a proposed definition of a set of properties for use in defining stable pattern syntax: syntax that is stable over future versions of the Unicode Standard.

Note to reviewers: Section 2 would eventually supersede Section 5.15 Identifiers from The Unicode Standard 4.0.

2. Default Identifier Syntax

The formal syntax provided here is intended to capture the general intent that an identifier consists of a string of characters that begins with a letter or an ideograph, and then includes any number of letters, ideographs, digits, or underscores. Each programming language standard has its own identifier syntax; different programming languages have different conventions for the use of certain characters from the ASCII range ($, @, #, _) in identifiers. To extend such a syntax to cover the full behavior of a Unicode implementation, implementers need only combine these specific rules with the sample syntax provided here.

Syntactic Rule

<identifier> := <identifier_start> (<identifier_start> | <identifier_extend>)*

Identifiers are defined by the following sets of character categories from the Unicode Character Database.

**Syntactic Classes for Identifiers**
Syntactic Class	Properties	Coverage
`<identifier_start>`	General Category = L or Nl, or Other_ID_Start = true	Uppercase letter, lowercase letter, titlecase letter, modifier letter, other letter, letter number, stability extensions
`<identifier_extend>`	General Category = Mn, Mc, Nd, Pc, or Cf	Nonspacing mark, spacing combining mark, decimal number, connector punctuation, formatting code

The innovations in the identifier syntax to cover the Unicode Standard include the following:

Incorporation of proper handling of combining marks
Allowance for layout and format control characters, which should be ignored when parsing identifiers

2.1 Combining Marks

Combining marks must be accounted for in identifier syntax. A composed character sequence consisting of a base character followed by any number of combining marks must be valid for an identifier. This requirement results from the requirement for combining marks in the representation of many languages, and the conformance rules in Chapter 3 regarding interpretation of canonical-equivalent character sequences.

Enclosing combining marks (for example, U+20DD..U+20E0) are excluded from the syntactic definition of <identifier_extend>, because the composite characters that result from their composition with letters (for example, U+24B6 circled latin capital letter a) are themselves not normally considered valid constituents of these identifiers.

2.2 Layout and Format Control Characters

The Unicode characters that are used to control joining behavior, bidirectional ordering control, and alternative formats for display are explicitly defined as not affecting breaking behavior. Unlike space characters or other delimiters, they do not serve to indicate word, line, or other unit boundaries. Accordingly, they should normally be ignored for the purposes of identifier definition. Implementations that cannot ignore characters in identifiers should exclude these characters.

2.3 Specific Character Adjustments

Specific identifier syntaxes can be treated as tailorings of the generic syntax based on character properties. For example, SQL identifiers allow an underscore as an identifier part (but not as an identifier start); C identifiers allow an underscore as either an identifier part or an identifier start. Specific languages may also want to exclude the characters that have a decomposition_type other than canonical or none, or to exclude some subset of those, such as those with a decomposition_type equal to font.

For programming language identifiers, normalization has a number of important implications. For a discussion of these issues, see Annex 7: Programming Language Identifiers in UAX #15, Unicode Normalization Forms [UAX15].

Note to reviewers: Would it be better to move that section into this UTR. Comments?

2.4 Backward Compatibility

Unicode General Category values are kept as stable as possible, but they can change across versions of the Unicode Standard. The Other_ID_Start property contains a small list of characters that qualified as <identifier_start> characters in some previous version of Unicode solely on the basis of their General Category properties, but that no longer qualify in the current version. In Unicode 4.0, this list consists of four characters:

U+2118 script capital p
U+212E estimated symbol
U+309B katakana-hiragana voiced sound mark
U+309C katakana-hiragana semi-voiced sound mark

The Other_ID_Start property is thus designed to ensure that the Unicode identifier specification is backward compatible: Any sequence of characters that qualified as an identifier in some version of Unicode will continue to qualify as an identifier in future versions.

3. Alternative Identifier Syntax

The down-side of working with the syntactic classes defined above is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible.

One method to address this problem is to turn the question around. Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever.

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers. For an example of such guidelines, see the XML 1.1 specification by the W3C [XML1.1].

By increasing the set of disallowed characters, a reasonably intuitive recommendation for identifiers can be achieved. This approach uses the full specification of identifier classes, as of a particular version of the Unicode Standard, and permanently disallows any characters not recommended in that version for inclusion in identifiers. All code points unassigned as of that version would be allowed in identifiers, so that any future additions to the standard would already be accounted for. This approach ensures both upwardly compatible identifier stability and a reasonable division of characters into those that do and do not make human sense as part of identifiers.

Some additional extensions to the list of disallowed code points can be made to further constrain “unnatural” identifiers. For example, one could include unassigned code points in blocks of characters set aside for future encoding as symbols, such as mathematical operators.

With or without such fine-tuning, such a compromise approach still incurs the expense of implementing large lists of code points. While they no longer change over time, it is a matter of choice whether the benefit of enforcing somewhat word-like identifiers justifies their cost.

Alternatively, one can use the properties described below, and allow all sequences of characters to be identifiers that are neither pattern syntax nor pattern whitespace. This has the advantage of simplicity and small tables, but allows many more “unnatural” identifiers.

4. Pattern Syntax

There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel or ICU number formats, and many others. These patterns have been very limited in the past, and forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax.

For forwards and backwards compatibility, it is very advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode Consortium made regarding completely stable identifiers, and the practice that is seen in XML 1.1 [XML1.1]. (In particular, the consortium committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.)

With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.

Example:

In version 1.3 of program X, '≈' is a reserved syntax character, e.g. it doesn't perform an operation, but you have to quote it. In version 1.4, '≈' gets a real meaning, e.g. uppercase the subsequent characters. In this example, '\' quotes the next character; i.e., causes it to be treated as a literal instead of a syntax character.

The pattern abc...\≈...xyz works on both version 1.3 and 1.4, and refers to the literal character since it is quoted in both cases.
The pattern abc...≈...xyz works on 1.1 and uppercases the following characters. On version 1.0, the engine (rightfully) has no idea what to do with ≈. Rather than silently fail (by ignoring ≈ or turning it into a literal), it has the opportunity signal an error.

This document provides a recommended set of code points that can be used for such pattern whitespace and syntax characters. Particular pattern languages may, of course, override these recommendations (for example, adding or removing other characters for compatibility in ASCII). But by providing a list of these in UCD properties, a stable, common basis for future expansion.

For stability, the property values will be absolutely invariant; not changing with successive versions of Unicode. Of course, this doesn't limit the ability of the Unicode Standard to add more symbol or whitespace characters, but the syntax and whitespace characters recommended for use in patterns would not change.

When generating rules or patterns, all whitespace and syntax code points that are to be literals would require quoting (using whatever quoting mechanism is available). For readability, it is recommended practice to quote or escape all literal whitespace and default ignorable code points as well.

Example: consider the following, where the items in angle brackets indicate literal characters.

a<SPACE>b => x<ZERO WIDTH SPACE>y + z;

Since <SPACE> is a Pattern_White_Space character, it would require quoting. Since <ZERO WIDTH SPACE> is a default ignorable character, it should also be quoted for readability. So if in this example \uXXXX is used for hex expression, but resolved before quoting, and single quotes are used for quoting, this might be expressed as:

'a\u0020b' => 'x\u200By' + z;

The two proposed pattern properties to for the next appropriate version of the UCD are Pattern_White_Space and Pattern_Syntax. The contents are presented here for review; they would be removed once incorporated into the [UCD]. The contents were derived as follows:

The proposed Pattern_White_Space characters were originally derived from White_Space by removing some characters that appeared inappropriate for patterns, and adding LRM and RLM. However, once we settle on their contents, they would be immutable from then on.
- The LRM and RLM are added so as to allow easier use of Arabic and Hebrew in Patterns. For example, a rule like:
  
  X / W => Y* / Z ;
  
  becomes almost unreadable when some of the W..Z are right-to-left (RTL) characters (e.g. Arabic or Hebrew) and others are left-to-right (LTR) characters. However, by surrounding the RTL strings by LRM (or the LTR characters by RLM), the rules can be made readable.
- The compatibility characters are removed.
The proposed Pattern_Syntax code points were derived from the following set, then some script-specific characters were removed, along with some other characters that appeared inappropriate for patterns.
- [[:gc=s:] | [:gc=p:] | [\u2190-\u2BFF]]

4.1 Proposed Pattern Properties

0009..000D ; Pattern_White_Space # <CHARACTER TABULATION>..<CARRIAGE RETURN (CR)>
0020       ; Pattern_White_Space # SPACE
0085       ; Pattern_White_Space # <NEXT LINE (NEL)>
00A0       ; Pattern_White_Space # NO-BREAK SPACE
2000..200A ; Pattern_White_Space # EN QUAD..HAIR SPACE
200E..200F ; Pattern_White_Space # LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
2028       ; Pattern_White_Space # LINE SEPARATOR
2029       ; Pattern_White_Space # PARAGRAPH SEPARATOR
202F       ; Pattern_White_Space # NARROW NO-BREAK SPACE
205F       ; Pattern_White_Space # MEDIUM MATHEMATICAL SPACE
3000       ; Pattern_White_Space # IDEOGRAPHIC SPACE

# Latin-1

0021..002F ; Pattern_Syntax # EXCLAMATION MARK..SOLIDUS
003A..0040 ; Pattern_Syntax # COLON..COMMERCIAL AT
005B..0060 ; Pattern_Syntax # LEFT SQUARE BRACKET..GRAVE ACCENT
007B..007E ; Pattern_Syntax # LEFT CURLY BRACKET..TILDE
00A1..00A7 ; Pattern_Syntax # INVERTED EXCLAMATION MARK..SECTION SIGN
00A9       ; Pattern_Syntax # COPYRIGHT SIGN
00AB..00AC ; Pattern_Syntax # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK..NOT SIGN
00AE       ; Pattern_Syntax # REGISTERED SIGN
00B0..00B1 ; Pattern_Syntax # DEGREE SIGN..PLUS-MINUS SIGN
00B6..00B7 ; Pattern_Syntax # PILCROW SIGN..MIDDLE DOT
00BB       ; Pattern_Syntax # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
00BF       ; Pattern_Syntax # INVERTED QUESTION MARK
00D7       ; Pattern_Syntax # MULTIPLICATION SIGN
00F7       ; Pattern_Syntax # DIVISION SIGN

# General punctuation, may include currently unassigned code points

2010..2027 ; Pattern_Syntax # HYPHEN..HYPHENATION POINT
2030..205E ; Pattern_Syntax # PER MILLE SIGN..<unassigned>

# Whole blocks
#   Arrows, Mathematical Operators, Miscellaneous Technical,
#   Control Pictures, Optical Character Recognition
#   Enclosed Alphanumerics, Box Drawing, Block Elements,
#   Geometric Shapes, Miscellaneous Symbols, Dingbats
#   Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A,
#   Braille Patterns, Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B, 
#   Supplemental Mathematical Operators, Miscellaneous Symbols and Arrows
#   NOTE: may include currently unassigned code points

2190..2BFF ; Pattern_Syntax # LEFTWARDS ARROW..<unassigned-2BFF>

# CJK Symbols and Punctuation

3001..3003 ; Pattern_Syntax # IDEOGRAPHIC COMMA..DITTO MARK
3008..3020 ; Pattern_Syntax # LEFT ANGLE BRACKET..POSTAL MARK FACE
3030       ; Pattern_Syntax # WAVY DASH

#Arabic Presentation Forms-A (should have been encoded elsewhere)

FD3E..FD3F ; Pattern_Syntax # ORNATE LEFT PARENTHESIS..ORNATE RIGHT PARENTHESIS

#CJK Compatibility Forms

FE45..FE46 ; Pattern_Syntax # SESAME DOT..WHITE SESAME DOT

Note to Reviewers: should the above Arabic Presentation Forms-A and CJK Compatibility Forms be retained?

Acknowledgements

Thanks to Eric Muller for feedback on this document.

References

[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[UCD]	Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files
[Unicode]	The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[UAX15]	UAX #15, Unicode Normalization Forms http://www.unicode.org/reports/tr15/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
[XML1.1]	Extensible Markup Language (XML) 1.1 http://www.w3.org/TR/xml11/

Modifications

The following summarizes modifications from the previous version of this document.

2	Modified Pattern White Space to remove compatibility characters Added example explaining use of Pattern White Space
1	First version: incorporated section from Unicode 4.0 on Identifiers plus new section on patterns.

Copyright © 2000-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.