L2/03-107

Re: Regex Changes
From: Mark Davis
Date: 2003-03-05

Here are document for discussion at the UTC meeting, with (draft) proposed changes for the Reg Ex TR (#18) for:

[93-A14] Action Item for Mark Davis: Look at Unicode Technical Report #18 Unicode Regular Expression Guidelines and propose changes that will make it more referenceable, possibly as a UTS.  

1. Post as a proposed UTS and solicit public feedback, adding a conformance section containing the following.

C1. An implementation claiming conformance to Level 1 of this specification shall meet the requirements described in the following sections:

C2. An implementation claiming conformance to Level 2 of this specification shall satisfy C1, and meet the requirements described in the following sections:

C3. An implementation claiming conformance to Level 3 of this specification shall satisfy C1 and C2, and meet the requirements described in the following sections:

C3. An implementation claiming partial conformance to a Level of this specification shall clearly describe all of the criteria for that Level that it does not satisfy.

2. Make the following changes:

A. List a minimal set of character properties for Level 1. (For comparison, see table below.) The following is a suggested list in the proposed update; in the public review we should solicit comments on this area in particular.

B. List out the recommended Unicode equivalents for the common character class names used in old non-Unicode regexps.   See Table 5-9 from Programming Perl, 3rd Edition (The O'Reilly book). These are only guidelines.

C. There is a 2.7 Surrogates that requires supplementary character support, but it should be better integrated in the text. In particular, the notation should be unified with 2.1. Hex Notation. The language in a few places needs to be adjusted to the latest glossary usage, in particular, the use of "supplementary characters".


Background

General Decomposition and Normalization CJK
Name Canonical_Combining_Class Ideographic
Block Decomposition_Mapping Unified_Ideograph
Age Composition_Exclusion Radical
General_Category Full_Composition_Exclusion IDS_Binary_Operator
Script Decomposition_Type IDS_Trinary_Operator
White_Space FC_NFKC_Closure ID_Start_Exceptions
Alphabetic NFC_Quick_Check Misc
Hangul_Syllable_Type NFKC_Quick_Check Math
Noncharacter_Code_Point NFD_Quick_Check Quotation_Mark
Default_Ignorable_Code_Point NFKD_Quick_Check Dash
Deprecated Expands_On_NFC Hyphen
Logical_Order_Exception Expands_On_NFD Terminal_Punctuation
Case Expands_On_NFKC Diacritic
Uppercase Expands_On_NFKD Extender
Lowercase Shaping and Rendering Grapheme_Base
Lowercase_Mapping Join_Control Grapheme_Extend
Titlecase_Mapping Joining_Group Grapheme_Link
Uppercase_Mapping Joining_Type Unicode_1_Name
Case_Folding Line_Break ISO_Comment
Simple_Lowercase_Mapping East_Asian_Width Contributory Properties
Simple_Titlecase_Mapping Bidi Other_Alphabetic
Simple_Uppercase_Mapping Bidi_Control Other_Default_Ignorable_Code_Point
Simple_Case_Folding Bidi_Mirrored Other_Grapheme_Extend
Special_Case_Condition Bidi_Class Other_Lowercase
Soft_Dotted Bidi_Mirroring_Glyph Other_Math
Identifiers Numeric Other_Uppercase
ID_Continue Numeric_Value  
ID_Start Numeric_Type  
XID_Continue Hex_Digit  
XID_Start ASCII_Hex_Digit