WORKING DRAFT Unicode Technical Report #18

Unicode Regular Expression Guidelines

Revision 3
Authors Mark Davis (mark@unicode.org)
Date 1998-10-06
This Version http://www.unicode.org/unicode/reports/tr18-3
Previous Version n/a
Latest Version http://www.unicode.org/unicode/reports/tr18

Summary

This document describes guidelines for how to adapt regular expression engines to use Unicode. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status of this document

This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.


Introduction

The following describes general guidelines for extending regular expression engines to handle Unicode. The following issues are involved in such extensions.

There are three fundamental levels of Unicode support that can be offered by regular expression engines:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, or UTF-16LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level is independent of locale. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.

Level 2: Extended Unicode Support. At this level, the regular expression engine also accounts for graphemes (what the end-user thinks of as a character), canonical equivalence, and surrogates. This is still a locale-independent level, but provides much better support for end-user expectations than the raw level 1, without the regular-expression writer needing to know about some of the complications of Unicode encoding structure.

Level 3: Locale-Sensitive Support. At this level, the regular expression engine also provides for locale-sensitive treatment of characters, for example, whereby the characters "ch" can behave as a single character. The results of a particular regular expression reflect the end-users expectations of what constitutes a character in their language, and what order the characters are in. However, there is a performance impact to support at this level.

Note: Unicode is a constantly evolving standard: new characters will be added in the future.
This means that a regular expression that tests for, say, currency symbols will have different results in Unicode 2.0 than in Unicode 2.1 (where the Euro currency symbol was added.)

At any level, efficiently handling properties or conditions based on a large character set can take a lot of memory. A common mechanism for reducing the memory requirements--while still maintaining performance--is the two-stage table, discussed in The Unicode Standard, Section 5.7. For example, the Unicode character properties can be stored in memory in a two-stage table with only 7 or 8Kbytes. Accessing those properties only takes a small amount of bit-twiddling and two array accesses.

Notation

In order to describe regular expression syntax, we will use a modified BNF form:

x y
the sequence consisting of x then y
x*
zero or more occurences of x
x?
zero or one occurence of x
x | y
either x or y
( x )
for grouping
XYZ
terminal character

The following syntax for character ranges will be used in successive examples. This is only a sample syntax for the purposes of examples in this paper. (Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation.)

<list> := LIST_START NEGATION? <item> (LIST_SEP? <item>)* LIST_END
<item> := <character>
<item> := <character> - <character>
<item> := ESCAPE <character>

LIST_START := "["
LIST_END := "]"
NEGATION := "^"
LIST_SEP := ","
ESCAPE := "\" 

Examples:

[a-z,A-Z,0-9] Match ASCII alphanumerics
[^a-z,A-Z,0-9] Match any character but ASCII alphanumerics
[\],\-,\^,\,] Match the literal characters ], -, ^, ','

The comma between items is not really necessary, but can improve readability.


Level 1: Basic Unicode Support

Regular expression syntax usually allows for an expression to denote a set of single characters, such as [a-z,A-Z,0-9]. Since there are a very large number of characters in the Unicode standard, simple list expressions do not suffice.

1.1 Hex notation

The character set used by the regular expression writer may not be Unicode, so there should be some way to specify arbitrary Unicode characters. The most standard notation for listing hex Unicode characters within strings is by prefixing with "\u". Making this change results in the following addition:

<character> := <simple_character>
<character> := ESCAPE UNICODE_MARK HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

UNICODE_MARK := "u" 

Examples:

[\u3040-\u309F,\u30FC] Match Hiragana characters, plus prolonged sound sign
[\u00B2,\u2082] Match superscript and subscript 2

1.2 Categories

Since Unicode is a large character set, a regular expression engine needs to provide for the recognition of categories. These should be extended using the types found in ftp://ftp.unicode.org/Public/UNIDATA.

For example, what the regular expression means by a"digit" should match to any of the Unicode digits, etc. The basic Unicode character categories are described in README.TXT, and consist of: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other. These each have a single letter abbreviation, which is the uppercase first character except for separators, which use Z. The data mapping Unicode characters to these categories is found on UnicodeData-Latest.txt.

Each of these categories has different subcategories. For example, the subcategories for Letter are uppercase, lowercase, titlecase, modifier, and other (in this case, other includes uncased letters such as Chinese). By convention, the subcategory is abbreviated by the category letter (in uppercase), followed by the first character of the subcategory in lowercase. For example, Lu stands for Letter, uppercase.

Unicode characters are also divided into blocks, as described in BLOCKS.TXT. For example, \u0370-\u03FF is the Greek block. (Note: The blocks may include characters not assigned in the current version of Unicode.)

A regular-expression mechanism may not choose to offer all of these, but it should be extended to more than just the ones applicable to English. Where a regular expression is expressed as much as possible in terms of higher-level semantic constructs such as Letter, it makes it easier to work with different alphabets and languages. Here is an example of a syntax addition that permits categories:

<item> := CATEGORY_START NEGATION? <unicode_category> CATEGORY_END

CATEGORY_START := "{"
CATEGORY_END := "}" 

Examples:

[{L},{Nd}] Match all letters and decimal digits
[A,{^L}] Match all non-letters, plus A
[a-ä,^{^L}] Match all letters in the binary range a-ä

These categories could either be spelled out completely, or abbreviated as they are in these examples. There are additional character properties from the Unicode properties that may be useful to add, although they are not as important as the character categories. The properties listed in PROPLST2.TXT may also be useful.

One additional special category is {ALL}, which matches all characters. This could also be captured with \u0000-\uFFFF, except for reasons we will get into later. The category {^Cn} is generally useful, matching all unassigned characters (in the target version of Unicode). This can be used to exclude these characters, as we will see in the next section.

See The Unicode Standard, Version 2.0, Chapter 4 for more information.

1.3 Subtraction

With a large character set, character categories are essential. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all letters but Q and W without having to list every character in {L} that is neither Q nor W. Here is an example of a syntax change that handles this, by allowing the negation to occur on any sequence of items:

Old:

<list> := LIST_START NEGATION? <item> (LIST_SEP? <item>)* LIST_END 

New:

<list> := LIST_START <item1> (LIST_SEP? <item1>)* LIST_END
<item1> := NEGATION? <item> 

This syntax can be used to determine a set of characters. The polarity starts out positive--items are successively added to the set. With a "^" character, the polarity switches to negative, and items are successively removed. That is, [A,B,^C,D,^E,F] is equivalent to the following set of characters (where "diff" stands for set difference):

((((({A} union {B}) diff {C}) diff {D}) union {E}) union {F})

A "^" at the very front of the list is implicitly preceded by the set of all characters:

[^A,B,C] == [\u0000-\uFFFF,^A,B,C]

Examples:

[{L}^Q,W] Match all letters but Q and W
[{N}^{Nd}^0-9] Match all non-decimal numbers, plus 0-9.
[\u0000-\u007F^{^L}] Match all letters in the ASCII range
[{Greek}^{Cn}] Match (modern) Greek characters

Another way to get much the same effect is to have syntax that allows subtraction between character lists, such as:

[{L}] - [Q,W] Match all letters but Q and W

1.4 Equivalence Classes

Case matches need to account for the large range of Unicode characters outside of ASCII. The information in the Unicode character database can be used to produce case-less matches.

Because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase. It is also possible for new characters to produce the reverse situation; where two different Unicode characters have the same lowercase. To correctly implement a caseless match, the safest approach is to do map each character ch to lowercase(uppercase(ch)) before comparing. (Of course, this can be done more efficiently internally by generating the required table of values once and caching that table.)

1.5 Words

Most regular expression engines allow a test for word boundaries. They generally use a very simple mechanism for determining word boundaries: a word boundary is between any pair of characters where one is a <word_character> and the other is not. A basic extension of this to work for Unicode is to make sure that the class of <word_character> includes all the Letter values from the Unicode character database, on UnicodeData-Latest.txt.

1.6 End Of Line

Most regular expression engines also allow a test for line boundaries. This presumes that lines of text are separated by line (or paragraph) separators. To follow the same approach with Unicode, the end-of-line or start-of-line testing should include not only CRLF, LF, CR, but also PS (U+2029) and LS (U+2028). See Unicode Technical Report #13, Unicode Newline Guidelines for more information.


Level 2. Extended Unicode Support

Level 1 support works well in many circumstances. However, it fails for more complex languages. Particularly important cases are word boundaries, grapheme boundaries, and canonical equivalence. (For more information about boundary conditions, see The Unicode Standard, Version 2.0, Section 5-13.)

At a slightly greater cost, Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still, however, locale independent.

2.1 Characters

One or more Unicode characters may make up what the user thinks of as a character (also known as a user character or grapheme). For example, "G" + acute-accent may be thought of as a single character by users, yet is actually represented by two Unicode characters. Locale-independent character boundaries that provide for graphemes can be determined by the following grammar:

<grapheme> := ( <base_character> | <hangul_conjoined> | <indic_cluster> )
              <combining_mark>*
<base_character> := <non_combining_mark> | <high_surrogate> <low_surrogate>
<hangul_conjoined> := <hangul_initial>+ <hangul_medial>* <hangul_final>*
<indic_cluster> := <simple_sequence> ( <virama> <simple_sequence> )*
<simple_sequence> := <base_character> <non-spacing_mark>*

The above characterization of grapheme does allow for some nonsensical sequences of characters, such as a <Latin A, Devenagari Virama, Greek Sigma>. A more complicated grammar could sift out these odd combinations, but it is generally not worth the effort, since (a) such combinations just don't occur in normal text, and (b) if they do occur, it doesn't really matter how they are grouped as graphemes.

In both the definition of <base_character> and <non_combining_mark>, we include character codes that may not be defined to represent characters in the current version of Unicode. Since they have no definition, the default choice is to treat them as undistinguished base characters. In future versions of Unicode, their status may change.
In particular, there are no surrogates defined in Unicode 2.1 (or even planned for Unicode 3.0): once they are added, then the grammar needs to be adapted to provide for them. At this point, the only provision that needs to be made is to keep surrogate pairs from breaking across grapheme or word boundaries.

The characters included in most of the terminals in the above expression can be derived from the Unicode character database properties:

<non_combining_mark> := [{^M}]
<non_spacing_mark> := [{Mn}{Me}]
<combining_mark> := [{M}] 

Some of them need to be explicitly listed:

<virama> := [\u094D...]
<hangul_initial> := [\u1100-\u115F]
<hangul_medial> := [\u1160-\u11A7]
<hangul_final> := [\u11A8-\u11FF]
<high_surrogate> := [\uD800-\uDBFF]
<low_surrogate> := [\uDC00-\uDFFF] 

See below for a discussion of locale-dependent character boundaries.

2.2 Words

The Level 1 support using simple <word_character> classes is only a very rough approximation of user word boundaries. A much better method takes into account more context than just a single pair of letters. A general algorithm can take care of character and word boundaries for most of the world's languages.

Note that word boundaries and line-break boundaries are not generally the same.

A simple test for word boundaries is to break graphemes into three classes: <letter_grapheme> and <non_letter_grapheme>, where the distinction is between the two is based on whether the first character of the grapheme is a Letter or not. Then the test for words is very similar to level 1 word boundary checks.

See The Unicode Standard, Version 2.0, Chapter 5 for more information, and extensions to handle Chinese, Japanese and Korean.

2.3 Canonical Equivalents

There are many instances where a character can be equivalently expressed by two different sequences of Unicode characters. For example, [ä] should match both "ä" and "a,\u0308". (See The Unicode Standard, Version 2.0, Section 2-5 and 3.9 for more information.) There are two main alternative implementations for handling this.


Level 3: Locale-Sensitive Support

All of the above deals with a locale-independent specification for a regular expression. However, a regular expression engine also may want to support locale-dependent specifications. This is especially important when the regular expression engine is being used by less sophisticated users instead of programmers.

For example, the order of Unicode characters may differ substantially from the order expected by users of a particular language. The regular expression engine has to decide, for example, whether the list [a-ä] means:

If both locale dependent and locale-independent regular expressions are supported, then both boundaries and sets of characters are affected. There are a number of possibilities:

Marking locales is best specified by means of the ISO 639 and 3166 tags, such as "en_US". For more information on these tags, see http://www.unicode.org/unicode/onlinedat/online.html.

3.1 Boundaries

Boundary determinations may be affected by locales. For example, locale conventions for Thai require a more sophisticated word-boundary detection.

A collation ordering will define what is a collation character, a string that is treated as a unit by the ordering. For example, "ch" is a collation character for a traditional Spanish ordering. This is the characterization of grapheme which is most useful for regular expressions if they are going to be accessible to end-users. See the Unicode Standard Collation Algorithm for more information.

In both of these cases, the regular expression engine probably needs to access platform services to make these determinations.

3.2 Equivalence Classes

There are many instances where the user wants a match that is general in some fashion. For example, one might want to match case variants of "a", or match any accented "a"s. In Level 1, we described caseless matches, but there are other interesting linguistic features that users may want to filter out. In line with the Unicode Standard Collation Algorithm, at least the following four levels are recommended:

Here is an example of how the sample syntax could be modified to account for this, using the level notation of the Unicode Standard Collation Algorithm.

<item> := {PRIMARY}   // match primary only: set to disregard accents, case...
<item> := {SECONDARY} // match primary & secondary only: set to disregard case...
<item> := {TERTIARY}  // match primary, secondary, tertiary.
<item> := {EXACT}     // match all levels (subject to language tailoring) 

Examples:

[{SECONDARY}a-m] Match a-m, plus case variants A-M, plus compatibility variants
[{PRIMARY}a^{SECONDARY}a] Match all non-decimal numbers, plus Match any accented A's,
but exclude unaccented A!-9.

Basic information for these equivalence classes can be derived from the data tables referenced by Unicode Standard Collation Algorithm.

3.3. Character Ranges

Some of Unicode character categories, such as punctuation, are not normative, and may vary from language to language or from country to country. For example, whether a curly quotation mark is opening or closing punctuation may vary. For those cases, the mapping of the categories to sets of characters will need to be locale dependent.

Locale-dependent character ranges will include the characters that would collate between the upper and lower bounds, according to the current locale collation conventions. This broadens the set of graphemes--in traditional Spanish, for example, [b-d] would match against "ch". Similarly, V and W are considered equivalent in Swedish collations, and so [V] will match "W" in Swedish, even with an exact match. This requires the regular expression engine to either draw upon the platform's collation, or incorporate its own.

Note: this is one reason why a category for all characters {ALL} is needed--it is possible for a locale's collation to not have [\u0000-\uFFFF] encompass all characters.

The Equivalence Classes mentioned above may further affect this match, by setting the requested strength for the collation. Languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if "Z" < "a"!

The expression [{PRIMARY}a-b] will have the effect of setting the Collation strength set to disregard case (that is, tertiary differences). That will end up matching both "a" and "A", no matter how the locale orders them. See the Unicode Standard Collation Algorithm for more information.

Collation Character Specification

This provokes one further addition to the syntax. The issue is where we want to specify a set of graphemes where one of the graphemes can only be expressed as multiple Unicode characters. Here is an example of a syntax extension to support this:

<item> := GRAPHEME_START <character> GRAPHEME_END

GRAPHEME_START := "<"
GRAPHEME_END := ">" 

This item will only match if it is a collation character for the current locale. Otherwise it will be ignored.

Examples:

[c,<ch>,l,<ll>] Match Spanish characters c, ch, l, and ll. If not in traditional Spanish, only match c and l.

Note: If commas or other list separators were required in listing characters instead of just optional, this syntax would not be necessary: [c,ch,l,ll] would be sufficient.


Summary

The following is a summary of the levels of Unicode support that regular expression engines can offer.

One of the most important requirements is to document clearly what Unicode features are and are not supported. Secondly, even if higher-level support is not currently offered, provision should be made for the syntax to be extended in the future to encompass those features.


Copyright

Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html