Unicode Technical Report #18

Unicode Regular Expression Guidelines

Version	6
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2002-04-21
This Version	http://www.unicode.org/reports/tr18/tr18-6
Previous Version	http://www.unicode.org/reports/tr18/tr18-5.1
Latest Version	http://www.unicode.org/reports/tr18
Tracking Number	6

Summary

This document describes guidelines for how to adapt regular expression engines to use Unicode. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Technical Report. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Technical Report (UTR) may contain either informative material or normative specifications, or both. Each UTR may specify a base version of the Unicode Standard. In that case, conformance to the UTR requires conformance to that version or higher.

A list of current Unicode Technical Reports is found on [Reports]. For more information about versions of the Unicode Standard, see [Versions]. Please mail corrigenda and other comments to the author(s).

1 Introduction
- 1.1 Notation
2 Basic Unicode Support: Level 1
3 Extended Unicode Support: Level 2
4 Tailored Support: Level 3
Annex A. Character Blocks
Annex B. Sample Collation Character Code
References
Acknowledgments
Modifications

1 Introduction

The following describes general guidelines for extending regular expression engines to handle Unicode. The following issues are involved in such extensions.

Unicode is a large character set—regular expression engines that are only adapted to handle small character sets will not scale well.
Unicode encompasses a wide variety of languages which can have very different characteristics than English or other western European text.

There are three fundamental levels of Unicode support that can be offered by regular expression engines:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level is independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.
Level 2: Extended Unicode Support. At this level, the regular expression engine also accounts for grapheme clusters (what the end-user generally thinks of as a character), better word-break, and canonical equivalence. This is still a default level—independent of country or language—but provides much better support for end-user expectations than the raw level 1, without the regular-expression writer needing to know about some of the complications of Unicode encoding structure.
Level 3: Tailored Support. At this level, the regular expression engine also provides for tailored treatment of characters (including country- or language-specific behavior), for example, whereby the characters ch can behave as a single character (in Slovak or traditional Spanish). The results of a particular regular expression reflect the end-users expectations of what constitutes a character in their language, and what order the characters are in. However, there is a performance impact to support at this level.

One of the most important requirements for a regular expression engine is to document clearly what Unicode features are and are not supported. Even if higher-level support is not currently offered, provision should be made for the syntax to be extended in the future to encompass those features.

Note: Unicode is a constantly evolving standard: new characters will be added in the future. This means that a regular expression that tests for, say, currency symbols will have different results in Unicode 2.0 than in Unicode 2.1 (where the Euro currency symbol was added.)

At any level, efficiently handling properties or conditions based on a large character set can take a lot of memory. A common mechanism for reducing the memory requirements — while still maintaining performance — is the two-stage table, discussed in Section 5.1 Transcoding to Other Standards [Chap5] of The Unicode Standard. For example, the Unicode character properties can be stored in memory in a two-stage table with only 7 or 8Kbytes. Accessing those properties only takes a small amount of bit-twiddling and two array accesses.

1.1 Notation

In order to describe regular expression syntax, we will use an extended BNF form:

`x y`	the sequence consisting of x then y
`x*`	zero or more occurences of x
`x?`	zero or one occurence of x
`x \| y`	either x or y
`( x )`	for grouping
`"XYZ"`	terminal character(s)

The following syntax for character ranges will be used in successive examples.

Note: This is only a sample syntax for the purposes of examples in this paper. (Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation. In general, the syntax here is similar to that of Perl Regular Expressions [Perl].)

LIST := "[" NEGATION? ITEM (SEP? ITEM)* "]"
ITEM := CODE_POINT
       := <character> "-" CODE_POINT // range
       := ESCAPE CODE_POINT

NEGATION := "^"
SEP := ""  // no separator = union 
    := "|" // union
ESCAPE := "\"

Code_point refers to any Unicode code point from U+0000 to U+10FFFF, although typically the only ones of interest will be those representing characters. Whitespace is allowed between any elements, but to simplify the presentation the many occurances of " "* are omitted.

Examples:

`[a-z \| A-Z \| 0-9]`	Match ASCII alphanumerics
`[a-z A-Z 0-9]`
`[a-zA-Z0-9]`
`[^a-z A-Z 0-9]`	Match anything but ASCII alphanumerics
`[\] \- \ ]`	Match the literal characters ], -, ','

2 Basic Unicode Support: Level 1

Regular expression syntax usually allows for an expression to denote a set of single characters, such as [a-z,A-Z,0-9]. Since there are a very large number of characters in the Unicode standard, simple list expressions do not suffice.

2.1 Hex notation

The character set used by the regular expression writer may not be Unicode, so there needs to be some way to specify arbitrary Unicode characters. The most standard notation for listing hex Unicode characters within strings is by prefixing with "\u". Making this change results in the following addition:

<character> := <simple_character>
<character> := ESCAPE UTF16_MARK
               HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

UTF16_MARK := "u"

Examples:

`[\u3040-\u309F \u30FC]`	Match Hiragana characters, plus prolonged sound sign
`[\u00B2 \u2082]`	Match superscript and subscript 2

Note: instead of [...\u3040...], one possible alternate syntax is [...\x{3040}...], as in Perl 5.6.
Note: more advanced regular expression engines can also offer the ability to use the Unicode character name in braces for readability. For control characters (marked with "<control>" in the Unicode Character Database), the Unicode 1.0 name can be used. Examples:
- \N{WHITE SMILING FACE} instead of \u263A
- \N{GREEK SMALL LETTER ALPHA} instead of \u03B1
- \N{FORM FEED} instead of \u000C

2.2 Properties

Since Unicode is a large character set, a regular expression engine needs to provide for the recognition of whole categories of characters; otherwise the listing of characters becomes impractical and error-prone. Engines should be extended using the Unicode character properties. For example, what the regular expression means by a digit should match to any of the Unicode digits, etc.

The official data mapping Unicode characters (and code points) to properties is the Unicode Character Database [UCD]. See also Chapter 4: Character Properties [Chap4] in The Unicode Standard.

The recommended names for UCD properties and property values are in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue]. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored.

Note: it may be a useful implementation technique to load the Unicode tables that support properties and other features on demand, to avoid unnecessary memory overhead for simple regular expressions that don't use those properties.

Where a regular expression is expressed as much as possible in terms of higher-level semantic constructs such as Letter, it makes it practical to work with the different alphabets and languages in Unicode. Here is an example of a syntax addition that permits properties:

ITEM := "\p{" NEGATION? PROP_SPEC 
        "}"

        PROP_SPEC := <binary_unicode_property>

        PROP_SPEC := <unicode_property> (":" | "=") 
        <unicode_property_value>

        PROP_SPEC := <script_or_category_property_value>

Examples:

`[\p{L} \p{Nd}]`	Match all letters and decimal digits
`[\p{letter} \p{decimal number}]`	Match all letters and decimal digits
`\p{^L}`	Match anything that is not a letter
`\p{^letter}`	Match anything that is not a letter
`\p{Line_Break:Alphabetic}`	Match anything that has the Line Break property value of Alphabetic
`\p{Whitespace}`	Match anything that has the binary property Whitespace

Some properties are binary: they are either true or false for a given code point. In that case, only the property name is required. Others have multiple values, so for uniqueness both the property name and the property value need to be included. For example, Alphabetic is both a binary property and a value of the Line_Break enumeration, so \p{Alphabetic} would mean the binary property, and \p{Line Break:Alphabetic} or \p{Line_Break=Alphabetic} would mean the enumerated property. There are two exceptions to this: the properties Script and General Category commonly have the property name omitted. Thus \p{Not_Assigned} is equivalent to \p{General_Category = Not_Assigned}, and \p{Greek} is equivalent to \p{Script:Greek}.

General Category Property

The most basic overall character property is the General Category, which is a basic categorization of Unicode characters into: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other. These property values each have a single letter abbreviation, which is the uppercase first character except for separators, which use Z. The official data mapping Unicode characters to the General Category value is in UnicodeData.txt [UData].

Each of these categories has different subcategories. For example, the subcategories for Letter are uppercase, lowercase, titlecase, modifier, and other (in this case, other includes uncased letters such as Chinese). By convention, the subcategory is abbreviated by the category letter (in uppercase), followed by the first character of the subcategory in lowercase. For example, Lu stands for Uppercase Letter.

Note: Since it is recommended that the property syntax be lenient as to spaces, casing, hyphens and underbars, any of the following should be equivalent: \p{Lu}, \p{lu}, \p{uppercase letter}, \p{uppercase letter}, \p{Uppercase_Letter}, and \p{uppercaseletter}

The General Category property values are listed below. For more information on the meaning of these values, see UnicodeData.html [UDataDoc].

Abb.	Long form
L	Letter
Lu	Uppercase Letter
Ll	Lowercase Letter
Lt	Titlecase Letter
Lm	Modifier Letter
Lo	Other Letter
M	Mark
Mn	Non-Spacing Mark
Mc	Spacing Combining Mark
Me	Enclosing Mark
N	Number
Nd	Decimal Digit Number
Nl	Letter Number
No	Other Number

Abb.	Long form
S	Symbol
Sm	Math Symbol
Sc	Currency Symbol
Sk	Modifier Symbol
So	Other Symbol
P	Punctuation
Pc	Connector Punctuation
Pd	Dash Punctuation
Ps	Open Punctuation
Pe	Close Punctuation
Pi	Initial Punctuation
Pf	Final Punctuation
Po	Other Punctuation

Abb.	Long form
Z	Separator
Zs	Space Separator
Zl	Line Separator
Zp	Paragraph Separator
C	Other
Cc	Control
Cf	Format
Cs	Surrogate
Co	Private Use
Cn	Not Assigned
-	Any*
-	Assigned*

The last two properties are not part of the General Category, but are generally useful.

Any matches all code points. This could also be captured with \u0000-\u10FFFF, except for reasons we will get into later. In some regular expression languages, \p{Any} may be expressed by a period, but that may exclude newline characters.
Assigned is equivalent to \p{^Cn}, and matches all assigned characters (for the target version of Unicode). It also includes all private use characters. It is useful for avoiding confusing double negatives. Note that Cn includes noncharacters, so Assigned excludes them.

Script Property

A regular-expression mechanism may choose to offer the ability to identify characters on the basis of other Unicode properties besides the General Category. In particular, Unicode characters are also divided into scripts as described in UTR #24: Script Names [ScriptDoc] (for the data file, see Scripts.txt [ScriptData]). Using a property such as \p{Greek} allows people test letters for whether they are Greek or not.

The script property values generally only pertain to letters. Other characters such as punctuation and accents are found in two special Script names from UTR #24: \p{Common} and \p{Inherited}. In general, programs should only use specific script values in conjunction with both Common and Inherited. That is, to sift out characters clearly not appropriate for Greek, one would use:

[\p{Greek}\p{Common}\p{Inherited}]

Since Common also includes all code points from U+0000 to U+10FFFF that are not in the other script categories, including unassigned characters, one may want to refine this more, such as by using:

[\p{Greek}\p{Common}\p{Inherited} - \p{Not Assigned}]

Other Properties

Other useful properties are described in the documentation for the Unicode Character Database, cited above. The binary properties include:

Alphabetic, Ideographic
Lowercase, Uppercase
- Note: these are larger classes of characters than the General Category Lowercase_Letter and Uppercase_Letter
White_Space, Bidi_Control, Join_Control
ASCII_Hex_Digit, Hex_Digit
Noncharacter_Code_Point
ID_Start, ID_Continue, XID_Start, XID_Continue
NF*_NO, NF*_MAYBE

The enumerated properties include:

Decomposition_Type
Numeric_Type
East_Asian_Width
Line_Break

A full list of the available properties is in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].

Blocks

Unicode blocks can sometimes also be a useful enumerated property. However, there are some very significant caveats to the use of Unicode blocks for the identification of characters: see Annex A. Character Blocks. If blocks are used, some of the names can collide with Script names, so they should be distinguished, such as in \p{Greek Block} or \p{Block=Greek}.

2.3 Subtraction and Intersection

With a large character set, character properties are essential. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all letters but Q and W without having to list every character in \p{letter} that is neither Q nor W. Here is an example of a syntax change that handles this, by allowing subtraction of any further items in the set.

ITEM := "[" ITEM "]" // for grouping
SEP := " "           // no separator = union 
    := "|"           // union
    := "-"           // removal
    := "&"           // intersection

Note that "-" between characters still means a range, not a removal.

Examples:

`[\p{L} - QW]`	Match all letters but Q and W
`[\p{N} - [\p{Nd} - 0-9]]`	Match all non-decimal numbers, plus 0-9.
`[\u0000-\u007F - ^\p{letter}]`	Match all letters in the ASCII range, by subtracting non-letters.
`[\p{greek } -` `\N{GREEK SMALL LETTER ALPHA}]`	Match Greek letters except alpha
`[\p{assigned} - a-f A-F 0-9]`	Match all assigned characters except for hex digits.

Note: with Perl-style syntax, one would use look-ahead to get the same effect as difference or intersection. For example, [\u0000-\u03FF~aeiouy] would be expressed as (?=[\x{0000}-\x{03FF}])[^aeiouy]. This looks ahead to see if the next character matches [\u0000-\u03FF], then checks that the character is not an English vowel.

2.4 Simple Word Boundaries

Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: a word boundary is between any pair of characters where one is a <word_character> and the other is not. A basic extension of this to work for Unicode is to make sure that the class of <word_character> includes all the Letter values from the Unicode character database, from UnicodeData.txt [UData].

Level 2 provides more general support for word boundaries between arbitrary Unicode characters.

2.5 Simple Loose Matches

The only loose matches that most regular expression engines offer is caseless matching. If the engine does offers this, then it must account for the large range of cased Unicode characters outside of ASCII. In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase. Level 1 implementations need to handle these cases. For example, the Greek U+03C3 "σ" small sigma, U+03C2 "ς" small final sigma, and U+03A3 "Σ" capital sigma must all match.

Some caseless matches may match one character against two: for example, U+00DF "ß" matches the two characters "SS". However, because many implementations are not set up to handle this, at Level 1 only simple case matches are necessary. To correctly implement a caseless match, see UAX #21: Case Mappings [Case]. The data file supporting caseless matching is CaseFolding.txt [CaseData].

If the implementation containing the regular expression engine also offers case conversions, then these should also be done in accordance with UAX #21. The relevant data files are SpecialCasing.txt [SpecialCasing] and UnicodeData.txt [UData]. A level 1 implementation might not offer anything but simple, default, case-insensitive conversions.

2.6 End Of Line

Most regular expression engines also allow a test for line boundaries. This presumes that lines of text are separated by line (or paragraph) separators. To follow the same approach with Unicode, the end-of-line or start-of-line testing should include not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028). Formfeed (U+000C) also normally indicates an end-of-line. For more information, see UAX #13, Unicode Newline Guidelines [NewLine].

These characters should be uniformly handled in determining logical line numbers, start-of-line, end-of-line, and arbitrary-character implementations. Logical line number is useful for compiler error messages and the like. Regular expressions often allow for SOL and EOL patterns, which match certain boundaries. Often there is also a "non-line-separator" arbitrary character pattern that excludes line separator characters.

Logical line number
- The line number is increased by one for each occurrence ofr:
  \u2028 | \u2029 | \u000D\u000A | \u000A | \u000C | \u000D | \u0085
Logical beginning of line (often "^")
- SOL is at the end of a file or string, and also immediately following any occurrence of:
  \u2028 | \u2029 | \u000D\u000A | \u000A | \u000C | \u000D | \u0085
- Note that there is no empty line within the sequence \u000D\u000A.
Logical end of line (often "$")
- EOL at the end of a file or string, and also immediately following any occurrence of:
  \u2028 | \u2029 | \u000D\u000A | \u000A | \u000C | \u000D | \u0085
- Note that there is no empty line within the sequence \u000D\u000A.
Arbitrary character pattern (often ".")
- should not match any of
  \u2028 | \u2029 | \u000A | \u000C | \u000D | \u0085
- Note that ^.*$ (an empty line pattern) should not match the empty string within the sequence \u000D\u000A, but should match the empty string within the sequence \u000A\u000D.

2.7 Surrogates

UTF-16 uses pairs of Unicode code units to express codepoints above FFFF₁₆. (See Section 3.7 Surrogates, or for an overview see Forms of Unicode [Forms]). While surrogate pairs could be used to identify code points above FFFF₁₆, that mechanism is clumsy. It is much more useful to provide specific syntax for specifying Unicode code points, such as the following:

<character> := <simple_character>
<character> := ESCAPE UTF32_MARK
               HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
               HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

UTF32_MARK := "U"

Examples:

[\U00100000] Match surrogate private use character

Note: this is one reason why a property for all characters \p{Any} is useful — it makes it unnecessary to use [\u0000-\U0010FFFF] to encompass all characters, and is independent of whether supplementary characters are supported or not.
Note: If the alternate syntax [...\x{3040}...] is used instead of \u, then this does not require any change, since the delimiter provides the length.

The second implication is that, surrogate pairs (or their equivalents in other encoding forms) need to be handled internally as single values. In particular, [\u0000-\U0010000] will match all the following sequence of code units:

Code Point	UTF-8 Code Units	UTF-16 Code Units	UTF-32 Code Units
`7F`	`7F`	`007F`	`0000007F`
`80`	`C2 80`	`0080`	`00000080`
`7FF`	`DF BF`	`07FF`	`000007FF`
`800`	`E0 A0 80`	`0800`	`00000800`
`FFFF`	`EF BF BF`	`FFFF`	`0000FFFF`
`10000`	`F0 90 80 80`	`D800 DC00`	`00010000`

3 Extended Unicode Support: Level 2

Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are canonical equivalence, word boundaries, grapheme cluster boundaries, and loose matches. (For more information about boundary conditions, see The Unicode Standard, Section 5-15.)

Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale-independent and easily implementable. However, the implementation may be slower when supporting Level 2, and some expressions may require Level 1 matches. Thus it is usually required to have some sort of syntax that will turn Level 2 support on and off.

3.1 Canonical Equivalents

There are many instances where a character can be equivalently expressed by two different sequences of Unicode characters. For example, [ä] should match both "ä" and "a\u0308". (See UAX #15: Unicode Normalization [Norm] and Sections 2.5 and 3.9 of The Unicode Standard for more information.) There are two main options for implementing this:

Before (or during) processing, translate text (and pattern) into a normalized form. This is the simplest to implement, since there are available code libraries for doing normalization

Expand the regular expression internally into a more generalized regular expression that takes canonical equivalence into account. For example, the expression [a-z,ä] can be internally turned into [a-z,ä] | (a \u0308). While this can be faster, it may also be substantially more difficult to generate expressions capturing all of the possible equivalent sequences.

Note: Combining characters are required for many characters. Even when text is in Normalization Form C, there may be combining characters in the text.

3.2 Default Grapheme Clusters

One or more Unicode characters may make up what the user thinks of as a character. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster. For example, "G" + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode characters.

Note: default grapheme clusters were previously referred to as "locale-independent graphemes". The term cluster has been added to emphasize that the term grapheme as used differently in linguistics. For simplicity and to align with UTS #10: Unicode Collation Algorithm [Collation], the terms "locale-independent" and "locale-dependent" been also changed to "default" and "tailored" respectively.

These default grapheme clusters are not the same as tailored grapheme clusters, which are covered in Level 3, Tailored Grapheme Clusters. The default grapheme clusters are determined according to the rules in UTR #29: Text Boundaries [Boundaries].

Regular expression engines should provide some mechanism for easily matching against grapheme clusters, since they are more likely to match user expectations for many languages. One mechanism for doing that is to have explicit syntax for clusters, as in the following. This syntax can also be used for tailored grapheme clusters (Tailored Grapheme Clusters).

ITEM := "\g{" CODE_POINT + "}"

Examples:

`[a-z\g{x\u0323}]`	Match a-z, and x with an under-dot (used in American Indian languages)
`[a-z\g{aa}]`	Match a-z, and aa (treated as a single character in Danish).

3.3 Default Words

The simple Level 1 support using simple <word_character> classes is only a very rough approximation of user word boundaries. A much better method takes into account more context than just a single pair of letters. A general algorithm can take care of character and word boundaries for most of the world's languages. For more information, see UTR #29: Text Boundaries [Boundaries].

Note: Word-break boundaries and line-break boundaries are not generally the same; line breaking has a much more complex set of requirements to meet the typographic requirements of different languages. See UAX #14: Line Breaking Properties [LineBreak] for more information. However, line breaks are not generally relevant to general regular expression engines.

A fine-grained approach to languages such as Chinese or Thai, languages that do not have spaces, requires information that is beyond the bounds of what a Level 2 algorithm can provide.

3.4 Default Loose Matches

At Level 1, caseless matches do not need to handle cases where one character matches against two. Level 2 includes caseless matches where one character may match against two (or more) characters. For example, 00DF "ß" will match against the two characters "SS".

To correctly implement a caseless match and case conversions, see UAX #21: Case Mappings [Case]. For ease of implementation, a complete case folding file is supplied at CaseFolding.txt [CaseData].

If the implementation containing the regular expression engine also offers case conversions, then these should also be done in accordance with UAX #21, with the full mappings. The relevant data files are SpecialCasing.txt [SpecialCasing] and UnicodeData.txt [UData].

4 Tailored Support: Level 3

All of the above deals with a default specification for a regular expression. However, a regular expression engine also may want to support tailored specifications, typically tailored for a particular language or locale. This may be important when the regular expression engine is being used by less sophisticated users instead of programmers. For example, the order of Unicode characters may differ substantially from the order expected by users of a particular language. The regular expression engine has to decide, for example, whether the list [a-ä] means:

the Unicode characters in binary order between 0061₁₆ and 00E5₁₆ (including 'z', 'Z', '[', and '¼'), or
the letters in that order in the users' locale (which does not include 'z' in English, but does include it in Swedish).

If both tailored and default regular expressions are supported, then a number of different mechanism are affected. There are a two main alternatives for control of tailored support:

coarse-grained support: the whole regular expression (or the whole script in which the regular expression occurs) can be marked as being tailored.
fine-grained support: any part of the regular expression can be marked in some way as being tailored.

Marking locales is generally specified by means of the common ISO 639 and 3166 tags, such as "en_US". For more information on these tags, see Online Data [Online].

Level 3 support may be considerably slower than Level 2, and some scripts may require either Level 1 or Level 2 matches instead. Thus it is usually required to have some sort of syntax that will turn Level 3 support on and off. Because tailored regular expression patterns are usually quite specific to the locale, and will generally not work across different locales, the syntax should also specify the particular locale or other tailoring customization that the pattern was designed for.

4.1. Tailored Properties

Some of Unicode character properties, such as punctuation, may in a few cases vary from language to language or from country to country. For example, whether a curly quotation mark is opening or closing punctuation may vary. For those cases, the mapping of the properties to sets of characters will need to be dependent on the locale or other tailoring.

4.2 Tailored Grapheme Clusters

Tailored grapheme clusters may be somewhat different than the default grapheme clusters discussed in Level 2. They are coordinated with the collation ordering for a given language in the following way. A collation ordering determines a collation grapheme cluster, which is a sequence of characters that is treated as a unit by the ordering. For example, ch is a collation character for a traditional Spanish ordering. More specifically, a collation character is the longest sequence of characters that maps to sequence of one or more collation elements where the first collation element has a primary weight and subsequent elements do not, and no completely ignorable characters are included.

The tailored grapheme clusters for a particular locale are the collation characters for the collation ordering for that locale. The determination of tailored grapheme clusters requires the regular expression engine to either draw upon the platform's collation data, or incorporate its own tailored data for each supported locale.

See UTS #10: Unicode Collation Algorithm [Collation] for more information about collation, and Annex B. Sample Collation Character Code for sample code.

4.3 Tailored Words

Semantic analysis may be required for correct word-break in languages that don't require spaces, such as Thai, Japanese, Chinese or Korean. This can require fairly sophisticated support if Level 3 word boundary detection is required, and usually requires drawing on platform OS services.

4.4 Tailored Loose Matches

In Level 1 and 2, caseless matches are described, but there are other interesting linguistic features that users may want to filter out. For example, V and W are considered equivalent in Swedish collations, and so [V] should match W in Swedish. In line with the UTS #10: Unicode Collation Algorithm [Collation], at the following four levels of equivalences are recommended:

exact match: bit-for-bit identity
tertiary match: disregard 4th level differences (language tailorings)
secondary match: disregard 3rd level differences such as upper/lowercase and compatibility variation (e.g. matching both half-width and full-width katakana).
primary match: disregard accents, case and compatibility variation; also disregard differences between katakana and hiragana.

If users are to have control over these equivalence classes, here is an example of how the sample syntax could be modified to account for this. The syntax for switching the strength or type of matching varies widely. Note that these tags switch behavior on and off in the middle of a regular expression; they do not match a character.

ITEM := \c{PRIMARY}   // match primary only
ITEM := \c{SECONDARY} // match primary & secondary only
ITEM := \c{TERTIARY}  // match primary, secondary, tertiary
ITEM := \c{EXACT}     // match all levels, normal state

Examples:

[\c{SECONDARY}a-m] Match a-m, plus case variants A-M, plus compatibility variants

Basic information for these equivalence classes can be derived from the data tables referenced by UTS #10: Unicode Collation Algorithm [Collation].

4.5. Tailored Ranges

Tailored character ranges will include tailored grapheme clusters, as discussed above. This broadens the set of grapheme clusters — in traditional Spanish, for example, [b-d] would match against "ch".

Note: this is another reason why a property for all characters \p{Any} is needed—it is possible for a locale's collation to not have [\u0000-\U0010FFFF] encompass all characters.

Languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if Z < a in that locale!

Annex A. Character Blocks

The Block property from the Unicode Character Database can be a useful property for quickly describing a set of Unicode characters. It assigns a name to segments of the Unicode codepoint space; for example, [\u0370-\u03FF] is the Greek block.

However, block names must be used with discretion; they are very easy to misuse since they only supply a very coarse view of the Unicode character allocation. For example:

Blocks are not at all exclusive. There are many mathematical operators that are not in the Mathematical Operators block; there are many currency symbols not in Currency Symbols, etc.
Blocks may include characters not assigned in the current version of Unicode. This can be both an advantage and disadvantage. Like the General Property, this allows an implementation to handle characters correctly that are not defined at the time the implementation is released. However, it also means that depending on the current properties of assigned characters in a block may fail. For example, all characters in a block may currently be letters, but this may not be true in the future.
Writing systems may use characters from multiple blocks: English uses characters from Basic Latin and General Punctuation, Syriac uses characters from both the Syriac and Arabic blocks, various languages use Cyrillic plus a few letters from Latin, etc.
Characters from a single writing system may be split across multiple blocks. See the table below. Moreover, presentation forms for a number of different scripts may be collected in blocks like Alphabetic Presentation Forms or Halfwidth and Fullwidth Forms.

Writing Systems	Blocks
Latin	Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional, Diacritics
Greek	Greek, Greek Extended, Diacritics
Arabic	Arabic Presentation Forms-A, Arabic Presentation Forms-B
Korean	Hangul Jamo, Hangul Compatibility Jamo, Hangul Syllables, CJK Unified Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months, Small Form Variants
Diacritics	Combining Diacritical Marks, Combining Marks for Symbols, Combining Half Marks
Yi	Yi Syllables, Yi Radicals
Chinese	CJK Unified Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months, Small Form Variants, Bopomofo, Bopomofo Extended
others	IPA Extensions, Spacing Modifier Letters, Cyrillic, Armenian, Hebrew, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Ethiopic, Cherokee, Unified Canadian Aboriginal Syllabics, Ogham, Runic, Khmer, Mongolian, CJK Radicals Supplement, Kangxi Radicals, Ideographic Description Characters, CJK Symbols and Punctuation, Hiragana, Katakana, Kanbun, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, General Punctuation, Superscripts and Subscripts, Currency Symbols, Letterlike Symbols, Number Forms, Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition, Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes, Miscellaneous Symbols, Dingbats, Braille Patterns, High Surrogates, High Private Use Surrogates, Low Surrogates, Private Use, Specials

For that reason, Script values are generally preferred to Block values.

Annex B: Sample Collation Character Code

The following provides sample code for doing Level 3 collation character detection. This code is meant to be illustrative, and has not been optimized. Although written in Java, it could be easily expressed in any programming language that allows access to the Unicode Collation Algorithm mappings.

/**
 * Return the end of a collation character.
 * @param s         the source string
 * @param start     the position in the string to search
 *                  forward from
 * @param collator  the collator used to produce collation elements.
 * This can either be a custom-built one, or produced from
 * the factory method Collator.getInstance(someLocale).
 * @return          the end position of the collation character
 */

static int getLocaleCharacterEnd(String s,
  int start, RuleBasedCollator collator) {
    int lastPosition = start;
    CollationElementIterator it 
      = collator.getCollationElementIterator(
          s.substring(start,s.length()));
    it.next(); // discard first collation element
    int primary;
        
    // accumulate characters until we get to a non-zero primary
        
    do {
        lastPosition = it.getOffset();
        int ce = it.next();
        if (ce == CollationElementIterator.NULLORDER) break;
        primary = CollationElementIterator.primaryOrder(ce);
    } while (primary == 0);
    return lastPosition;
}

References

[Boundaries]	UTR #29: Text Boundaries http://www.unicode.org/reports/tr29/ (At the time of this writing, UTR #29 was in proposed draft stage.)
[Case]	UAX #21: Case Mappings http://www.unicode.org/reports/tr21/
[CaseData]	http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
[Chap4]	http://www.unicode.org/uni2book/ch04.pdf
[Chap5]	http://www.unicode.org/uni2book/ch05.pdf
[Collation]	UTS #10: Unicode Collation Algorithm http://www.unicode.org/reports/tr10/
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Forms]	Davis, Mark. "Forms of Unicode" http://www-4.ibm.com/software/developer/library/utfencodingforms/
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[LineBreak]	UAX #14: Line Breaking Properties http://www.unicode.org/reports/tr14/
[NewLine]	UAX #13, Unicode Newline Guidelines http://www.unicode.org/reports/tr13/
[Norm]	UAX #15: Unicode Normalization http://www.unicode.org/reports/tr15/
[Online]	http://www.unicode.org/onlinedat/online.html
[Perl]	http://www.perl.com/pub/q/documentation
[Prop]	http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt
[PropValue]	http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[ScriptData]	http://www.unicode.org/Public/UNIDATA/Scripts.txt
[ScriptDoc]	UTR #24: Script Names http://www.unicode.org/reports/tr24/
[SpecialCasing]	http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
[UCD]	http://www.unicode.org/ucd/
[UData]	http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
[UDataDoc]	http://www.unicode.org/Public/UNIDATA/UnicodeData.html
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Acknowledgments

Thanks to Karlsson Kent, Jarkko Hietaniemi, Gurusamy Sarathy, Tom Watson and Kento Tamura for their feedback on the document.

Modifications

The following summarizes modifications from the previous version of this document.

Fixed 16-bit reference, moved Supplementary characters support (surrogates) to level 1.
Generally changed "locale-dependent" to "default", "locale-independent" to "tailored" and "grapheme" to "grapheme cluster"
Changed syntax slightly to be more like Perl
Added explicit table of General Category values
Added clarifications about scripts and blocks
Added descriptions of other properties, and a pointer to the default names
Referred to TR 29 for grapheme cluster and word boundaries
Removed old annex B (word boundary code)
Removed spaces from anchors
Added references, modification sections
Rearranged property section
Minor editing

Copyright © 2000-2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Unicode Technical Report #18

Unicode Regular Expression Guidelines

Summary

Status

Contents

1 Introduction

1.1 Notation

2 Basic Unicode Support: Level 1

General Category Property

Script Property

Other Properties

Blocks

2.3 Subtraction and Intersection

3 Extended Unicode Support: Level 2

4 Tailored Support: Level 3