Analyzing Unicode Text:
Regular Expressions, Boundaries, Sets and More
Markus Scherer - IBM Corporation

Intended Audience: Software Engineers, Testers, Web Designers

Session Level: Beginner, Intermediate, Advanced

Regular Expressions have been widely used for many years to analyze, parse or extract desired information from text data. They are used in applications large and small, and everywhere in-between, from simple search operations in word processors to scripting languages such as Perl to queries on large data bases.

Traditional regular expressions cannot easily deal with a character set of the size and complexity of Unicode. To address this shortcoming, the Unicode Consortium has published Technical Report #18, a set of guidelines for extending regular expressions to handle Unicode data. Following this allows organizations to correctly deal with data in different languages and scripts.

This paper will review the issues and techniques involved in writing Regular Expressions for Unicode data. The guidelines from TR 18 will be reviewed, including a discussion of Unicode encoding forms, character properties and classes, text boundaries, case sensitivity and normalization, and the implications of all of these for handling different languages in regular expressions. The paper will also survey the capabilities and limitations of those regular expression implementations known to provide significant support for Unicode.

The presentation is intended primarily for users of regular expressions rather than implementers of regular expression engines.