Regex Changes

L2/03-107

Re:	Regex Changes
From:	Mark Davis
Date:	2003-03-05

Here are document for discussion at the UTC meeting, with (draft) proposed changes for the Reg Ex TR (#18) for:

[93-A14] Action Item for Mark Davis: Look at Unicode Technical Report #18 Unicode Regular Expression Guidelines and propose changes that will make it more referenceable, possibly as a UTS.

1. Post as a proposed UTS and solicit public feedback, adding a conformance section containing the following.

C1. An implementation claiming conformance to Level 1 of this specification shall meet the requirements described in the following sections:

2 Basic Unicode Support: Level 1

C2. An implementation claiming conformance to Level 2 of this specification shall satisfy C1, and meet the requirements described in the following sections:

3 Extended Unicode Support: Level 2

C3. An implementation claiming conformance to Level 3 of this specification shall satisfy C1 and C2, and meet the requirements described in the following sections:

4 Tailored Support: Level 3

C3. An implementation claiming partial conformance to a Level of this specification shall clearly describe all of the criteria for that Level that it does not satisfy.

2. Make the following changes:

A. List a minimal set of character properties for Level 1. (For comparison, see table below.) The following is a suggested list in the proposed update; in the public review we should solicit comments on this area in particular.

General_Category
Script
White_Space
Alphabetic
Uppercase
Lowercase
Numeric_Value
Noncharacter_Code_Point
Default_Ignorable_Code_Point
Deprecated

B. List out the recommended Unicode equivalents for the common character class names used in old non-Unicode regexps. See Table 5-9 from Programming Perl, 3rd Edition (The O'Reilly book). These are only guidelines.

C. There is a 2.7 Surrogates that requires supplementary character support, but it should be better integrated in the text. In particular, the notation should be unified with 2.1. Hex Notation. The language in a few places needs to be adjusted to the latest glossary usage, in particular, the use of "supplementary characters".

Background

General	Decomposition and Normalization	CJK
Name	Canonical_Combining_Class	Ideographic
Block	Decomposition_Mapping	Unified_Ideograph
Age	Composition_Exclusion	Radical
General_Category	Full_Composition_Exclusion	IDS_Binary_Operator
Script	Decomposition_Type	IDS_Trinary_Operator
White_Space	FC_NFKC_Closure	ID_Start_Exceptions
Alphabetic	NFC_Quick_Check	Misc
Hangul_Syllable_Type	NFKC_Quick_Check	Math
Noncharacter_Code_Point	NFD_Quick_Check	Quotation_Mark
Default_Ignorable_Code_Point	NFKD_Quick_Check	Dash
Deprecated	Expands_On_NFC	Hyphen
Logical_Order_Exception	Expands_On_NFD	Terminal_Punctuation
Case	Expands_On_NFKC	Diacritic
Uppercase	Expands_On_NFKD	Extender
Lowercase	Shaping and Rendering	Grapheme_Base
Lowercase_Mapping	Join_Control	Grapheme_Extend
Titlecase_Mapping	Joining_Group	Grapheme_Link
Uppercase_Mapping	Joining_Type	Unicode_1_Name
Case_Folding	Line_Break	ISO_Comment
Simple_Lowercase_Mapping	East_Asian_Width	Contributory Properties
Simple_Titlecase_Mapping	Bidi	Other_Alphabetic
Simple_Uppercase_Mapping	Bidi_Control	Other_Default_Ignorable_Code_Point
Simple_Case_Folding	Bidi_Mirrored	Other_Grapheme_Extend
Special_Case_Condition	Bidi_Class	Other_Lowercase
Soft_Dotted	Bidi_Mirroring_Glyph	Other_Math
Identifiers	Numeric	Other_Uppercase
ID_Continue	Numeric_Value
ID_Start	Numeric_Type
XID_Continue	Hex_Digit
XID_Start	ASCII_Hex_Digit