The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Sun Apr 20, 2014 9:09 am

All times are UTC - 6 hours [ DST ]


Forum rules


Use this forum for technical discussion of UAXes 11, 14, 15, 24, 29, 31, 34, 42, and 44. Technical discussion of UTSes 6, 10, 18, 22, 39, and 46. Technical discussion of UTRs 16, 17, 20, 23, 25, 26, 33, and 36, as well as the related properties and files in the Unicode Character Database.



Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: Regex caseless matching: properties
PostPosted: Sun Jan 23, 2011 4:57 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
On the property issue (considering just single-character case folding for now).

I see two choices: Functional (aka black box) approach vs not.

A. With a functional approach, if I have a regex pattern R, then if R matches "Ab", then R caselessly matches "aB". It doesn't matter what the internals are, it is the function that matters.

B. With a non-functional (whitebox) approach, I have to do a deep analysis of R in order to see what a caseless match would be, because properties don't change their values (I'm guessing this is what Asmus is advocating).

Here is the difference in practice. Suppose we have a complicated regex pattern with deeply embedding instances, of the form

R = ...[\p{ASCII}&\p{Lu}]...\N{LATIN CAPITAL LETTER A}...

Suppose that I replace these expressions by what they denote in the current version of Unicode.

R' = ...[A-Z]...[A]...

With a functional approach, R and R' behave the same, whether used with a caseless match or not. With a non-function approach, this will not be the case.

There are two issues to consider: (a) what would be the regex user's expectations, and (b) what impact the choice would make on implementations.

I tend to think that having R and R' behave the same has clearer semantics, and is probably easier to implement (or at least no harder). For the latter part, I know that ICU's implementation (and probably others), resolve a complex unicode set like:

[[[\p{whitespace}\p{Alphabetic}]&\p{script=Latn}]-[[\p{ASCII}&\p{Lu}]\N{LATIN CAPITAL LETTER A}]]

down to a data structure that provides for fast lookup, the equivalent of the following list of ranges:

[a-zªºÀ-ÖØ-öø-ʸˠ-ˤᴀ-ᴥ ᴬ-ᵜᵢ-ᵥᵫ-ᵷᵹ-ᶾḀ-ỿⁱⁿₐ-ₜK ÅℲⅎⅠ-ↈⱠ-ⱿꜢ-ꞇꞋ-ꞎꞐꞑꞠ-ꞩꟺ-ꟿ ff-stA-Za-z]

It is easy to match against such a structure; it is also easy to transform it into a structure that contains the case-closure, whether or not the original pattern is retained.

For a whitebox implementation, you'd need to keep the original pattern around, for you can't just convert the end result; you have to consider for each final node whether to treat it as case-insensitive or not, and build up the entire expression with the case-closures of the ones you consider to be case-insensitive.


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: properties
PostPosted: Fri Jan 28, 2011 3:30 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 185
Here's a simple question:

When should this regex match characters that are not in 4.0?

/\p={age=4.0}/

how about this regex:

/(?i)\p={age=4.0}/

(Substitute any version number where a case pair happens to have one partner in a more recent version, i.e. coded later).

I think the answer to the first question is "never". And the answer to the second question would intuitively also seem to be "never".

Now another simple question:

How many Unicode characters are matched by this regex:

/\p={hex_digit}/

And how many by this regex:

/(?i)\p={hex_digit}/

Because hex_digit is defined irrespective of case, the answer should be the same, 44 characters (because of inclusion of wide character equivalents).

If caseless matching is defined so that it includes matches on a substring of multicharacter expansions during case fold, then the number jumps considerably, because you would suddenly match on the "f" of the "fi", "fl", "ffi" etc ligatures, the "a" of "a+right half ring" and who knows how many others.

My point is that in both of these two examples, you've violated the constraint expressed by the choice of property in an essentially unpredictable way. Just because of caseless matching, the first example no longer properly filters for version repertoire and the second example is useless for parsing hex digits.

Now you might say, the user should override them manually, but why? There seems to be no reasonable case why anyone would want the matches generated by applying these two "caselessly".


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: properties
PostPosted: Fri Jan 28, 2011 4:54 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
When should regular expression R match characters that are not matched by R.

Well, never. I'd agree with you there.

When should regular expression R when closed over transform T match characters that are not matched by R?

When T(character) is matched by R.


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: properties
PostPosted: Fri Jan 28, 2011 6:54 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 185
mark wrote:
When should regular expression R when closed over transform T match characters that are not matched by R?

When T(character) is matched by R.


And when does this match the (practical, not merely formal) expectations of regex users?

When T is "well-behaved".

Casefolding that includes expansion is not well-behaved, because, as I have done, you can easily create scenarios with absurd or highly unexpected results.

Casefolding that includes only single character mapping is not well-behaved in the context of some properties, such as "age".

(And, don't forget, casefolding in the 1:1 sense is a no-op for many properties, since they either explicitly contain non-cased characters (e.g. \p{g=Mc}) or are defined to include all case variants (hex_digit).)


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: properties
PostPosted: Fri Jan 28, 2011 7:13 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 185
mark wrote:
Suppose we have a complicated regex pattern with deeply embedding instances, of the form
Code:
R = ...[p{ASCII}&\p{Lu}]...\N{LATIN CAPITAL LETTER A}...

Suppose that I replace these expressions by what they denote in the current version of Unicode.

Code:
R' \= ...[A-Z]...[A]...

With a functional approach, R and R' behave the same, whether used with a caseless match or not. With a non-function approach, this will not be the case.


Rewriting the regex R internally into a series of literal sets as in R' is a fine implementation technique. For some regexes, the properties are indeed just shorthands for longer lists of characters.

But, the problem that vexed Karl, and that I see as well, is that some of the Unicode properties have semantics that are not readily obvious if you reduce them to sets.

The examples I used, age and hex_digit were chosen because of that.

Applying any transform that isn't closed under these properties is likely to lead to unanticipated results, because the most obvious use of these properties would be to filter input (do not admit any unsupported character, i.e. age>4.0, or non-hex digit). Having the transform bring in characters that don't satisfy the filter set up by the property presents the kind of gotcha is similar to many security risks.

Now, in Perl, you can already retain case sensitivity by grouping the (?i) operator. The question boils down to: do you require everybody to manually protect such properties from case transform, or do you turn things around where most properties are excluded from automatic case folding, unless perhaps explicitly enabled.

For the regex implementation that need to support partially transformed regexes neither option would represent a fundamental difference to the implementation. However, and that's the part that interests me, one or the other choice could lead to more predictable outcomes.

By predictable I don't mean a deterministic outcome, that is a given - but one that is correctly anticipated even in the edge cases by any programmer.


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: properties
PostPosted: Wed Feb 02, 2011 2:59 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
If anything is to be done about regex in the upcoming UTC, then a proposal needs to be submitted as soon as possible. (Such a proposal could set out the options if it is too early to really have a preferred course of action.)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com