UnicodeIUC24
ProgramShowcaseRegistrationAccommodationTravelSponsors
Unicode StandardConference BoardConference CDLast ConferencePast ConferencesNext Conference
Abstract

Pattern Matching with Multilingual Regular Expressions

Weiran Zhang - Oracle Corporation

Intended Audience: Managers, Software Engineers, Systems Analysts, Marketers, Technical Writers, Testers, Web Administrators, Designers
Session Level: Intermediate, Advanced

Regular expressions have long gained general popularity in most computing environments as a powerful tool for text and data pattern matching and manipulation. They offer a tremendous amount of processing power to a broad range of applications through a versatile and concise syntax that can be used to solve large and small problems alike. However, regular expression implementations are traditionally designed to support Western European data only, it follows that certain match concepts are not well-defined when extended to support multiple languages. It is therefore highly desirable to have a universal regular expression model that can work with all languages with different linguistic characteristics and be able to perform pattern matching in a locale-sensitive manner. The Unicode Regular Expression Guidelines (UTR#18) documents the general guidelines for adapting regular expression engines to support Unicode and describes the levels of support possible.

This paper explores the design and development of a multilingual regular expression engine capable of handling arbitrary number of languages and character sets. We will cover the support for locale-sensitive features such as Unicode character support, character properties, linguistic ranges, special collation elements, equivalence classes, common optimization techniques, performance considerations, and so on. We will survey the multilingual capabilities in the existing major regular expression packages and utilities, including Perl 5, Java, GNU, XML, etc. In conclusion, we will illustrate the ideas discussed by introducing the new multilingual regular expression features in the upcoming Oracle release, which brings the power of complete multilingual regular expression search to Oracle database through native support in SQL and PL/SQL environments.

Unicode
When the world wants to talk, it speaks Unicode

UnicodeIUC24
ProgramShowcaseRegistrationAccommodationTravelSponsors
Unicode StandardConference BoardConference CDLast ConferencePast ConferencesNext Conference
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

30 May 2003, Webmaster