Draft Unicode Technical Report #18
Unicode Regular Expression Guidelines

Revision 4

Authors Mark Davis (mark@unicode.org)

Date 1999-10-04

This Version http://www.unicode.org/unicode/reports/tr18-4

Previous Version http://www.unicode.org/unicode/reports/tr18-3

Latest Version http://www.unicode.org/unicode/reports/tr18

Summary

This document describes guidelines for how to adapt regular expression engines to use Unicode. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status of this document

This document contains informative material which has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative material or as normative specification.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions/ for more information.

Please mail corrigenda and other comments to the authors.

Introduction
- Notation
§1. Basic Unicode Support: Level 1
§2. Extended Unicode Support: Level 2
§3. Locale-Sensitive Support: Level 3
Annex A. Character Blocks
Annex B. Sample Word Boundary Code
Annex C. Sample Collation Character Code

Introduction

The following describes general guidelines for extending regular expression engines to handle Unicode. The following issues are involved in such extensions.

Unicode is a large character set—regular expression engines that are only adapted to handle small character sets will not scale well.
Unicode encompasses a wide variety of languages which can have very different characteristics than English or other western European text.

There are three fundamental levels of Unicode support that can be offered by regular expression engines:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level is independent of locale. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.
Level 2: Extended Unicode Support. At this level, the regular expression engine also accounts for graphemes (what the end-user thinks of as a character), better word-break, canonical equivalence, and surrogates. This is still a locale-independent level, but provides much better support for end-user expectations than the raw level 1, without the regular-expression writer needing to know about some of the complications of Unicode encoding structure.
Level 3: Locale-Sensitive Support. At this level, the regular expression engine also provides for locale-sensitive treatment of characters, for example, whereby the characters ch can behave as a single character. The results of a particular regular expression reflect the end-users expectations of what constitutes a character in their language, and what order the characters are in. However, there is a performance impact to support at this level.

One of the most important requirements for a regular expression engine is to document clearly what Unicode features are and are not supported. Even if higher-level support is not currently offered, provision should be made for the syntax to be extended in the future to encompass those features.

Note: Unicode is a constantly evolving standard: new characters will be added in the future. This means that a regular expression that tests for, say, currency symbols will have different results in Unicode 2.0 than in Unicode 2.1 (where the Euro currency symbol was added.)

At any level, efficiently handling properties or conditions based on a large character set can take a lot of memory. A common mechanism for reducing the memory requirements — while still maintaining performance — is the two-stage table, discussed in The Unicode Standard, Section 5.7. For example, the Unicode character properties can be stored in memory in a two-stage table with only 7 or 8Kbytes. Accessing those properties only takes a small amount of bit-twiddling and two array accesses.

Notation

In order to describe regular expression syntax, we will use an extended BNF form:

x y	the sequence consisting of x then y
x*	zero or more occurences of x
x?	zero or one occurence of x
x \| y	either x or y
( x )	for grouping
XYZ	terminal character

The following syntax for character ranges will be used in successive examples. This is only a sample syntax for the purposes of examples in this paper. (Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation.)

<list> := LIST_START NEGATION? <item> (LIST_SEP? <item>)* LIST_END
<item> := <character>
       | <character> "-" <character> // range
       | ESCAPE <character>

LIST_START := "["
NEGATION := "^"
LIST_END := "]"
LIST_SEP := ","
ESCAPE := "\"

Examples:

`[a-z,A-Z,0-9]`	Match ASCII alphanumerics
`[^a-z,A-Z,0-9]`	Match anything but ASCII alphanumerics
`[\],\-,\,]`	Match the literal characters ], -, ','

The comma between items is not really necessary, but can improve readability.

§1. Basic Unicode Support: Level 1

Regular expression syntax usually allows for an expression to denote a set of single characters, such as [a-z,A-Z,0-9]. Since there are a very large number of characters in the Unicode standard, simple list expressions do not suffice.

§1.1 Hex notation

The character set used by the regular expression writer may not be Unicode, so there needs to be some way to specify arbitrary Unicode characters. The most standard notation for listing hex Unicode characters within strings is by prefixing with "\u". Making this change results in the following addition:

<character> := <simple_character>
<character> := ESCAPE UTF16_MARK HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

UTF16_MARK := "u"

Examples:

`[\u3040-\u309F,\u30FC]`	Match Hiragana characters, plus prolonged sound sign
`[\u00B2,\u2082]`	Match superscript and subscript 2

Note: instead of [...\u3040...], one possible alternate syntax is [...\x{3040}...], as in Perl 5.6.

§1.2 Categories

Since Unicode is a large character set, a regular expression engine needs to provide for the recognition of whole categories of characters; otherwise the listing of characters becomes impractical. Engines should be extended using the general category types found in ftp://ftp.unicode.org/Public/UNIDATA.

For example, what the regular expression means by a digit should match to any of the Unicode digits, etc. The basic Unicode character categories are available in the Unicode Character Database (UCD) described in UnicodeCharacterDatabase.html and UnicodeData.html, and consist of: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other. These each have a single letter abbreviation, which is the uppercase first character except for separators, which use Z. The official data mapping Unicode characters to these categories is found on UnicodeData.txt (for a preview of Unicode 3.0.0, see UnicodeData-3.0.0.beta.txt).

Each of these categories has different subcategories. For example, the subcategories for Letter are uppercase, lowercase, titlecase, modifier, and other (in this case, other includes uncased letters such as Chinese). By convention, the subcategory is abbreviated by the category letter (in uppercase), followed by the first character of the subcategory in lowercase. For example, Lu stands for Letter, uppercase.

Where a regular expression is expressed as much as possible in terms of higher-level semantic constructs such as Letter, it makes it practical to work with the different alphabets and languages in Unicode. Here is an example of a syntax addition that permits categories:

<item> := CATEGORY_START <unicode_category> CATEGORY_END

CATEGORY_START := "{"
CATEGORY_END := "}"

Examples:

[{L},{Nd}] Match all letters and decimal digits

Two additional special categories that are generally useful are:

`{ALL}`	This matches all code points. This could also be captured with `\u0000-\uFFFF`, except for reasons we will get into later. In some regular expression languages, [{ALL}] is expressed by a period.
`{ASSIGNED}`	This is equivalent to all character but `{Cn}`, and matches all assigned characters (in the target version of Unicode).
`{UNASSIGNED}`	This is equivalent to `{Cn}`, and matches all unassigned characters (in the target version of Unicode). This can be used to exclude unassigned characters, as we will see in the next section.

A regular-expression mechanism may choose to offer the ability to identify characters on the basis of other Unicode properties besides the general category. In particular, Unicode characters are also divided into blocks as described in blocks.txt. For example, [\u0370-\u03FF] is the Greek block, and could be matched by syntax like [{Greek}]. However, there are some significant caveats to the use of Unicode blocks for the identification of characters: see Annex A. Character Blocks. See also Chapter 4, Character Properties for more information.

Note: instead of using {-L}, one alternate syntax is to use {xL}; another is to use case differences to provide the semantic of negation as in Perl. In that style, one could write {lo} for uppercase letter, and {LO} for any character but a lowercase letter.

§1.3 Subtraction

With a large character set, character categories are essential. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all letters but Q and W without having to list every character in {L} that is neither Q nor W. Here is an example of a syntax change that handles this, by allowing subtraction of any character set from another.

Old:	<list> := LIST_START <item> (LIST_SEP? <item>)* LIST_END
New:	<item> := <character> \| <character> "-" <character> // range \| ESCAPE <character> \| <list> ("-" <list>)+ // subtraction

Examples:

`[[{L}]-[Q,W]]`	Match all letters but Q and W
`[[{N}]-[{Nd}]0-9]`	Match all non-decimal numbers, plus 0-9.
`[[\u0000-\u007F]-[{-L}]]`	Match all letters in the ASCII range, by subtracting non-letters.
`[[\u0370-\u03FF]-[{UNASSIGNED}]]`	Match currently assigned modern Greek characters
`[^a-f,A-F,0-9]]`	Match all characters except for hex digits. Equivalent to `[{ALL}]-[a-f,A-F,0-9]`
`[[{ASSIGNED}]-[a-f,A-F,0-9]]`	Match all assigned characters except for hex digits.

It may also be useful to add syntax for an intersection of character ranges. This is often clearer than using set difference, such as using [[\u0370-\u03FF]&[{ASSIGNED}]] instead of [\u0370-\u03FF]-[{UNASSIGNED}].

Note: with Perl-style syntax, one would use lookahead to get the same effect as difference or intersection. For example, [[\u0000-\u03FF]-[aeiouy]] would be expressed as (?=[\u0000-\u03FF])[^aeiouy]. This looks ahead to see if the next character matches [\u0000-\u03FF], then checks that the character is not an English vowel.

§1.4 Equivalence Classes

The only equivalence classes that most regular expression engines offer is caseless matching. If the engine does offers this, then it must account for the large range of cased Unicode characters outside of ASCII. In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase. To correctly implement a caseless match and case conversions, see UTR #21: Case Mappings. For example, a caseless match should be able to equate the following:

single German character 00DF "ß" with the two characters "SS", and
both the Greek 03C3 "σ" small sigma and 03C2 "ς" small final sigma with the 03A3 "Σ" capital sigma.

For ease of implemenation, a complete case folding file is supplied at http://www.unicode.org/unicode/reports/tr21/CaseFolding.txt.

§1.5 Simple Word Boundaries

Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: a word boundary is between any pair of characters where one is a <word_character> and the other is not. A basic extension of this to work for Unicode is to make sure that the class of <word_character> includes all the Letter values from the Unicode character database, on UnicodeData-Latest.txt.

Level 2 provides more general support for word boundaries between arbitrary Unicode characters.

§1.6 End Of Line

Most regular expression engines also allow a test for line boundaries. This presumes that lines of text are separated by line (or paragraph) separators. To follow the same approach with Unicode, the end-of-line or start-of-line testing should include not only CRLF, LF, CR, but also PS (U+2029) and LS (U+2028). See Unicode Technical Report #13, Unicode Newline Guidelines for more information.

§2. Extended Unicode Support: Level 2

Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are surrogates, word boundaries, grapheme boundaries, and canonical equivalence. (For more information about boundary conditions, see The Unicode Standard, Section 5-15.)

At a slightly greater cost, Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still, however, locale independent and easily implementable.

§2.1 Surrogates

The standard form of Unicode is UTF-16. It uses pairs of Unicode code units to express codepoints above FFFF₁₆. (See Section 3.7 Surrogates, or for an overview see Forms of Unicode). While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them. This has two implications. First, while surrogate pairs can be used to identify code points above FFFF₁₆, that mechanism is very clumsy. It is much more useful to provide specific syntax for specifying Unicode code points, such as the following:

<character> := <simple_character>
<character> := ESCAPE UTF32_MARK HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

UTF32_MARK := "v"

Examples:

[\v100000] Match surrogate private use character

Note: this is one reason why a category for all characters {ALL} is useful—it makes it unnecessary to use [\u0000-\v10FFFF] to encompass all characters.
Note: If the alternate syntax [...\x{3040}...] is used instead of \u, then this does not require any change, since the delimiter provides the length.

The second implication is that, surrogate pairs (or their equivalents in other encoding forms) need to be handled internally as single values. In particular, [\u0000-\v10000] will match all the following sequence of code units (and more):

Code Point	UTF-8 Code Units	UTF-16 Code Units	UTF-32 Code Units
`7F`	`7F`	`007F`	`0000007F`
`80`	`C2 80`	`0080`	`00000080`
`7FF`	`DF BF`	`07FF`	`000007FF`
`800`	`E0 A0 80`	`0800`	`08000000`
`FFFF`	`EF BF BF`	`FFFF`	`0000FFFF`
`10000`	`F0 90 80 80`	`D800 DC00`	`00010000`

§2.2 Graphemes

One or more Unicode characters may make up what the user thinks of as a character (also known as a user character or grapheme). For example, "G" + acute-accent may be thought of as a single character by users, yet is actually represented by two Unicode characters. (This is not the same as locale-dependent character boundaries, which are covered in Level 3, Boundaries.) Locale-independent character boundaries that provide for graphemes can be determined by the following grammar:

<grapheme> := <base_character>? <combiner>*
<combiner> := <combining_mark> 
            | <virama> <letter>
            | <hangul_medial>
            | <hangul_final>
            | <extras>

The only time a grapheme does not begin with a base character is when a combining mark is at the start of text, or preceded by a control or format character. The above characterization of grapheme does allow for some nonsensical sequences of characters, such as a <Latin A, Devenagari Virama, Greek Sigma>. A more complicated grammar could sift out these odd combinations, but it is generally not worth the effort, since:

such combinations just don't occur in normal text, and
if they do occur, it doesn't really matter how they are grouped as graphemes

The characters included in most of the terminals in the above expression can be derived from the Unicode character database properties, while some of them need to be explicitly listed:

<combining_mark> := [{M}]
<base_character> := [[{ASSIGNED}]-[{M},{C}]]
<letter> := [{L}]
<virama> := [\u094D...] // characters with canonical class = 9 in the UCD.
<hangul_medial> := [\u1160-\u11A7]
<hangul_final> := [\u11A8-\u11FF]
<extras> := [\uFF9E,\uFF9F]

In particular, if graphemes are handled properly, then [g-h] will match against the three character sequence "g\u0308\u0300".

§2.3 Words

The Level 1 support using simple <word_character> classes is only a very rough approximation of user word boundaries. A much better method takes into account more context than just a single pair of letters. A general algorithm can take care of character and word boundaries for most of the world's languages.

A very straightforward implementation of the Level 2 word-boundaries test is to break characters into three classes instead of two.

<letter> := [{L}{N}{Mc}]
<ignore> := [[{Mn}{Me}{Cf}{Cc}] - [\u0009,\u000A,\u000B,\u000C,\u000D]]
<other> := [[{ALL}] - <letter> - <ignore>]

The <ignore> characters include both non-spacing marks as well as format characters such as the Right-Left Mark that should be ignored during word-break processing. Given this, then word-breaks occur only in the following two situations; otherwise there are no breaks. (The '÷' mark below shows a wordbreak position.)

<letter><ignore>* ÷ <other>
<other><ignore>* ÷ <letter>.

An implementation generally only needs to looks at the character preceding the possible break position and the character following it. Only if there is an <ignore> character preceding the possible break position does it need to scan backwards. See Chapter 5, Implementation Guidelines for more information, and Annex B. Sample Word Boundary Code for sample code. Notice that this code will never break except on grapheme boundaries.

For example, the following shows a test case for correct word-break boundaries:

String:	#\u031Fh\u031Fg\u0308#\u031F\u0338&\u031Fg\u0308
Boundaries:	#\u031F	h\u031Fg\u0308	#\u031F\u0338&\u031F	g\u0308

Word-break boundaries and line-break boundaries are not generally the same; line breaking has a much more complex set of requirements to meet the typographic requirements of different languages. See UTR #14: Line Breaking Properties for more information. However, line breaks are not generally relevant to general regular expression engines.

Level 2 word-break support is not suited for a fine-grained approach to languages such as Chinese or Thai, which require information that is beyond the bounds of what a simple Level 2 algorithm can provide.

§2.4 Canonical Equivalents

There are many instances where a character can be equivalently expressed by two different sequences of Unicode characters. For example, [ä] should match both "ä" and "a\u0308". (See Unicode Technical Report #15: Unicode Normalization and Section 2-5 and 3.9 for more information.) There are two main options for implementing this:

Before processing, translate text (and pattern) into a normalized form. This is the simplest to implement, since there are available code libraries for doing normalization; or
Expand the regular expression internally into a more generalized regular expression that takes canonical equivalence into account. For example, the expression [a-z,ä] can be internally turned into [a-z,ä] | (a \u0308). While this can be faster, it may also be substantially more difficult to generate expressions capturing all of the possible equivalent sequences.

§3. Locale-Sensitive Support: Level 3

All of the above deals with a locale-independent specification for a regular expression. However, a regular expression engine also may want to support locale-dependent specifications. This is especially important when the regular expression engine is being used by less sophisticated users instead of programmers.

For example, the order of Unicode characters may differ substantially from the order expected by users of a particular language. The regular expression engine has to decide, for example, whether the list [a-ä] means:

the Unicode characters in binary order between 0061₁₆ and 00E5₁₆ (including 'z', 'Z', '[', and '¼') or
the letters in that order in the users' locale (which does not include 'z' in English, but does include it in Swedish).

If both locale dependent and locale-independent regular expressions are supported, then both boundaries and sets of characters are affected. There are a number of possibilities:

coarse-grained support: the whole regular expression (or the whole script in which the regular expression occurs) can be marked as being locale-dependent.
fine-grained support: any part of the regular expression can be marked in some way as being locale-dependent.
ultra fine-grained support: any part of the regular expression can be marked in some way as being locale-dependent, and upon a particular locale.

Marking locales is generally specified by means of the common ISO 639 and 3166 tags, such as "en_US". For more information on these tags, see http://www.unicode.org/unicode/onlinedat/online.html.

§3.1 Locale Boundaries

Boundary determinations may be affected by locales. Semantic analysis may be required for correct word-break in languages that don't require spaces, such as Thai, Japanese, Chinese or Korean. This can require fairly sophisticated support if Level 3 boundary detection is required.

Locale-based grapheme boundaries may make somewhat different determinations than the locale-insensitive method discussed in Level 2. These boundaries are coordinated with the collation ordering for a given language. A collation ordering determines what is a collation character, which is a sequence of characters that is treated as a unit by the ordering. For example, ch is a collation character for a traditional Spanish ordering. Collation characters can be determined algorithmically. They are those character sequences that map to a sequence of one or more collation elements, where the first collation element has a primary weight and any subsequent elements do not. See the UTR #10: Unicode Collation Algorithm for more information about collation, and Annex C. Sample Collation Character Code for sample code.

In both of these cases, the regular expression engine may need to access platform services to make these determinations.

§3.2 Locale Equivalence Classes

There are many instances where the user wants a match that is general in some fashion. For example, one might want to match case variants of the letter a, or match any accented a. In Level 1, we described caseless matches, but there are other interesting linguistic features that users may want to filter out. In line with the UTR #10: Unicode Collation Algorithm, at least the following four levels are recommended:

exact match: bit-for-bit identity
tertiary match: disregard 4th level differences (language tailorings)
secondary match: disregard 3rd level differences such as upper/lowercase and compatibility variation (e.g. matching both half-width and full-width katakana).
primary match: disregard accents, case and compatibility variation; also disregard differences between katakana and hiragana.

Here is an example of how the sample syntax could be modified to account for this. Note that these tags switch behavior on and off in the middle of a regular expression. The syntax for doing this kind of thing varies widely.

<item> := {PRIMARY}   // match primary only: set to disregard accents, case...
<item> := {SECONDARY} // match primary & secondary only: set to disregard case...
<item> := {TERTIARY}  // match primary, secondary, tertiary.
<item> := {EXACT}     // match all levels (subject to language tailoring)

Examples:

[{SECONDARY}a-m] Match a-m, plus case variants A-M, plus compatibility variants

Basic information for these equivalence classes can be derived from the data tables referenced by UTR #10: Unicode Collation Algorithm.

§3.3. Locale Character Ranges

Some of Unicode character categories, such as punctuation, are not normative and may vary from language to language or from country to country. For example, whether a curly quotation mark is opening or closing punctuation may vary. For those cases, the mapping of the categories to sets of characters will need to be locale dependent.

Locale-dependent character ranges will include the characters that would collate between the upper and lower bounds, according to the current locale collation conventions. This broadens the set of graphemes — in traditional Spanish, for example, [b-d] would match against "ch". Similarly, V and W are considered equivalent in Swedish collations, and so [V] will match W in Swedish, even with an exact match. This requires the regular expression engine to either draw upon the platform's collation, or incorporate its own.

Note: this is another reason why a category for all characters {ALL} is needed—it is possible for a locale's collation to not have [\u0000-\v10FFFF] encompass all characters.

The Equivalence Classes mentioned above may further affect this match, by setting the requested strength for the collation. Languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if Z < a in that locale!

The expression [{SECONDARY}a-b] will have the effect of setting the Collation strength set to disregard case and other tertiary differences). That will end up matching both "a" and "A", no matter how the locale orders them. See the UTR #10: Unicode Collation Algorithm for more information.

Collation Character Specification

This provokes one further addition to the syntax. The issue is where we want to specify a set of graphemes where one of the graphemes can only be expressed as multiple Unicode characters. Here is an example of a syntax extension to support this:

<item> := GRAPHEME_START <character> GRAPHEME_END

GRAPHEME_START := "<"
GRAPHEME_END := ">"

This item will only match if it is a collation character for the current locale. Otherwise it will be ignored.

Examples:

[c,<ch>,l,<ll>] Match Spanish characters c, ch, l, and ll. If not in traditional Spanish, only match c and l.

Note: If commas or other list separators were required in listing characters instead of just optional, this syntax would not be necessary: [c,ch,l,ll] would be sufficient.

Annex A. Character Blocks

The Block property from the Unicode Character Database (as described in blocks.txt) can be a useful property for quickly describing a set of Unicode characters. It assigns a name to segments of the Unicode codepoint space; for example, [\u0370-\u03FF] is the Greek block.

However, block names must be used with discretion; they are very easy to misuse since they only supply a very coarse view of the Unicode character allocation. For example:

Blocks are not at all exclusive. There are many mathematical operators that are not in the Mathematical Operators block; there are many currency symbols not in Currency Symbols, etc.
Blocks may include characters not assigned in the current version of Unicode. This can be both an advantage and disadvantage. Like the General Property, this allows an implementation to handle characters correctly that are not defined at the time the implementation is released. However, it also means that depending on the current properties of assigned characters in a block may fail. For example, all characters in a block may currently be letters, but this may not be true in the future.
Writing systems may use characters from multiple blocks: English uses characters from Basic Latin and General Punctuation, Syriac uses characters from both the Syriac and Arabic blocks, various languages use Cyrillic plus a few letters from Latin, etc.
Characters from a single writing system may be split across multiple blocks. See the table below. Moreover, presentation forms for a number of different scripts may be collected in blocks like Alphabetic Presentation Forms or Halfwidth and Fullwidth Forms.

Shorthand	Blocks
Latin	Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional
Arabic	Arabic Presentation Forms-A, Arabic Presentation Forms-B
Hangul	Hangul Jamo, Hangul Compatibility Jamo, Hangul Syllables
Greek	Greek, Greek Extended
Diacritics	Combining Diacritical Marks, Combining Marks for Symbols, Combining Half Marks
CJK Compatibility	CJK Compatibility, CJK Compatibility Forms, Enclosed CJK Letters and Months, Small Form Variants
CJK	CJK Unified Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility Ideographs
Yi	Yi Syllables, Yi Radicals
Bopomofo	Bopomofo, Bopomofo Extended
others	IPA Extensions, Spacing Modifier Letters, Cyrillic, Armenian, Hebrew, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Ethiopic, Cherokee, Unified Canadian Aboriginal Syllabics, Ogham, Runic, Khmer, Mongolian, CJK Radicals Supplement, Kangxi Radicals, Ideographic Description Characters, CJK Symbols and Punctuation, Hiragana, Katakana, Kanbun, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, General Punctuation, Superscripts and Subscripts, Currency Symbols, Letterlike Symbols, Number Forms, Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition, Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes, Miscellaneous Symbols, Dingbats, Braille Patterns, High Surrogates, High Private Use Surrogates, Low Surrogates, Private Use, Specials

Annex B: Sample Word Boundary Code

The following provides sample code for doing Level 2 word boundary detection. This code is meant to be illustrative, and has not been optimized. Although written in Java, it could be easily expressed in any programming language that allows access to the Unicode Character Database information.

/**
 * Wordbreaks are where are two different word types on each side of
 * a caret position.
 * This is typically used in CTRL-arrow movement, Search by Whole Word,
 * double-click, and \b matches in regular expressions.
 * The following routine shows an example of how this is extended to
 * Unicode characters in general.<p>
 * Complications:
 * <ul><li>Non-spacing marks are treated as the type of
 * their base character. This means that:
 * you never break before an nsm; 
 * if you are after an nsm, you scan backward.
 * <li>Control (most) and format characters are ignored,
 * which means you treat them just like non-spacing marks.
 * <li>Note that this is very much simpler than Linebreak, which is described
 * in <a href="http://www.unicode.org/unicode/reports/tr14/">
 * UTR #14: Line Breaking Properties</a>.
 * </ul>
 */
public static boolean isWordBreak(String s, int position) {
	int after = getWordType(s,position);
	int before = getWordType(s,--position);
	    
	// handle non-spacing marks and others like them
	    
	if (after == IGNORE) return false;
	while (before == IGNORE) {
	    before = getWordType(s,--position);
	}
	    
	// return true if different
	    
	return (before != after);
}
	
/**
 * Wordbreak type
 */
public static int
	IGNORE = 0, // special class for ignored items
	LETTER = 1,
	OTHER = 2;
	
/**
 * Get the word-type of a character.
 */
public static int getWordType(String s, int position) {
	    
	// for simplicity, treat end-of-string like a non-word character
	    
	if (position < 0 || position >= s.length()) {
	    return OTHER;
	}
	    
	// Map from General Category to desired type
	// if you don't want to break between numbers and non-numbers
	// then change the mapping here to be the same.
	    
	char c = s.charAt(position);
	switch (Character.getType(c)) {
	    default:
	    return OTHER;
        case Character.UPPERCASE_LETTER:
        case Character.LOWERCASE_LETTER:
        case Character.TITLECASE_LETTER:
        case Character.MODIFIER_LETTER:
        case Character.OTHER_LETTER:
        case Character.COMBINING_SPACING_MARK:
        case Character.LETTER_NUMBER:
        case Character.DECIMAL_DIGIT_NUMBER:
        case Character.OTHER_NUMBER:
	    return LETTER;
        case Character.FORMAT:
        case Character.NON_SPACING_MARK:
        case Character.ENCLOSING_MARK:
        return IGNORE;
        case Character.CONTROL:
	    
	    // special case controls. Unicode doesn't assign them
	    // types because theoretically they could change by platform.
    	    
	    switch (c) {
	        case '\t':
	        case '\n':
	        case '\u000B':
	        case '\u000C':
	        case '\r':
	        return OTHER;
	        default:
	        return IGNORE;
	    }
    }
}

Annex C: Sample Collation Character Code

The following provides sample code for doing Level 3 collation character detection. This code is meant to be illustrative, and has not been optimized. Although written in Java, it could be easily expressed in any programming language that allows access to the Unicode Collation Algorithm mappings.

/**
 * Return the end of a collation character.
 * @param s         the source string
 * @param start     the position in the string to search forward from
 * @param collator  the collator used to produce collation elements. This
 * can either be a custom-built one, or produced from the factory method
 * Collator.getInstance(someLocale).
 * @return          the end position of the collation character
 */

static int getLocaleCharacterEnd(String s, int start, RuleBasedCollator collator) {
    int lastPosition = start;
    CollationElementIterator it 
      = collator.getCollationElementIterator(s.substring(start,s.length()));
    it.next(); // discard first collation element
    int primary;
        
    // accumulate characters until we get to a non-zero primary
        
    do {
        lastPosition = it.getOffset();
        int ce = it.next();
        if (ce == CollationElementIterator.NULLORDER) break;
        primary = CollationElementIterator.primaryOrder(ce);
    } while (primary == 0);
    return lastPosition;
}

Copyright

Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html

Revision	4
Authors	Mark Davis (mark@unicode.org)
Date	1999-10-04
This Version	http://www.unicode.org/unicode/reports/tr18-4
Previous Version	http://www.unicode.org/unicode/reports/tr18-3
Latest Version	http://www.unicode.org/unicode/reports/tr18

Draft Unicode Technical Report #18 Unicode Regular Expression Guidelines

Summary

Status of this document

Contents

§1. Basic Unicode Support: Level 1

Collation Character Specification

Annex B: Sample Word Boundary Code

Copyright

Draft Unicode Technical Report #18
Unicode Regular Expression Guidelines