The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Wed Oct 22, 2014 1:28 am

All times are UTC - 6 hours [ DST ]


Forum rules


Use this forum for technical discussion of UAXes 11, 14, 15, 24, 29, 31, 34, 42, and 44. Technical discussion of UTSes 6, 10, 18, 22, 39, and 46. Technical discussion of UTRs 16, 17, 20, 23, 25, 26, 33, and 36, as well as the related properties and files in the Unicode Character Database.



Post new topic This topic is locked, you cannot edit posts or make further replies.  [ 12 posts ] 
Author Message
 Post subject: Re: Caseless matching for literals and literal set members
PostPosted: Sun Jan 02, 2011 9:44 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
By request of the thread-starter for "Draft proposal for UTS18 caseless matching on properties", replies that address case insensitve matching for characters, rather than for properties, have been spit off into this new topic. ForumAdmin

Quote:
Chapter 3 of the standard defines the Default Caseless Matching algorithm, but it applies only to two strings, and extending it to apply to patterns is not trivial, and is totally unspecified, as far as I have seen.


One thing where I agree with Karl is that the question whether case insensitve mapping should support only simple and common (1:1 mapping) cases, or also the "full" mappings is not adequately addressed.

In particular in connection with operators such as ?, and * it's not a trivial question to decide.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Wed Jan 12, 2011 5:01 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
I didn't understand the comment.

I agree with the original posting that an efficient, standard guideline for implementing caseless regex matching (eg (?i) needs to be there.

All of the significant regex syntaxeis support caseless matching, and it is more than a convenience. For example, if you have a complicated expression:

(final|static).*UnicodeSet\(\)

they provide a way to match caselessly without having the extremely clumsy and error-prone reformulation as:

([fF][iI][nN][aA]...[tT]\(\)

So there is a real need for the ability to do this. The only issue really is what would be the correct recommendation. The chief problem is that any equivalence relationship that matches two strings with different numbers of codepoints is difficult to implement with a regex engine, at least efficiently.

Take the regex expression /as\p{L}\p{N}/

That is, with a 1:1 case folding relationship, that'd turn into:

[aA][sSſ]\p{L}\p{N}

(you can use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{toCaseFold%3Da} to see the single matches)

But then it gets a bit complicated, when you include the characters that case fold to some that contains 'a' or 's', which you can see with http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{toCaseFold%3D/a/} (changing the arguments in /.../ appropriately).

The items that casefold to something that contains 'a' are: [aA ẚ], and those for 's' are [sSſ ßẞ stſt], so you get funny overlaps. Since a caseless match of ſt = st, does that take up 2 slots (thus matching "s\p{L}")? And so on.

We define a simple case mapping (1:1), but not a simple case folding. One could derive it from our mappings, and that might be the best recommendation we could make, even though it will miss some cases.

(Note: canonically-equivalent regex matching has similar (but worse) issues in implementation...)


Top
 Profile  
 
 Post subject: Caseless matching for literals and literal set members
PostPosted: Wed Jan 12, 2011 6:59 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
I think we all agree on the desirability (requirement) for having a shortcut for case pairs and triples of literals and literal set members.

These can be handled two ways. One is by (virtually) rewriting the regex to replace each instance of a literal with a set covering its equivalence class. (For example a -> [Aa], or [a] -> [Aa]).

The other is to case insentively match the set of strings defined by the regex with the target string.

These two approaches would yield different results in some of the edge cases you mention.

I believe in the former interpretation its easier to handle the restriction to a 1:1 mapping, and easier to manually override that restriction by expressly adding 1:n mappings as alternatives (using OR) in the regex.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Wed Jan 12, 2011 7:59 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
It depends on what the criteria are. One possible criterion for case-insensitivity is:

  • A regex pattern P is case-insensitive when: P matches S if and only if P matches toCaseFold(S).

That criterion is easy to understand for users. And given it, you'd define (?i) such that:

  • Let P' be /(?i)P/, then
  • P' is case-insensitive
  • If P matches S then P' matches toCaseFold(S).

(assume that P doesn't turn off (?i) internally.)

However, (and it is a big however), while this is easy to understand, and could be worth striving for, I think it would be exceedingly difficult to implement in any kind of reasonably efficient manner. So we'd probably need to back off to something more achievable.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Fri Jan 14, 2011 3:51 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Quote:
One possible criterion for case-insensitivity is:
  • A regex pattern P is case-insensitive when: P matches S if and only if P matches toCaseFold(S).
That criterion is easy to understand for users.


I think this criterion is not as easy to understand in all cases, especially if you allow full case folding (changing the length).

For example, take German Maße (measures) and Masse (mass). The simple case folding for ß is itself, but the full case mapping is "SS", so the case folding is "ss".

Searching for /(?i)ß/ therefore matches both Maße (-> masse) and Masse (-> masse). This is indeed as expected and useful.

However, searching for /(?i)s/ will now match (twice) on Maße. That is not as expected, and not useful.

In the former example, the pattern /(?i)ß/ is equivalent to a pattern that explicitly includes all members of the equivalence class of ß under casefolding. In other words, substituting /(ß|ss|SS|Ss|sS)/ but not casefolding the target would yield the same matches.

In the latter example, the pattern /(?i)s/ is no longer equivalent to a pattern containing all members of the equivalence class of s under casefolding. In other words substituting /(s|S)/ while not casefolding the target would not yield the same matches. Further, there's no way to write an explicit pattern to get the same match: The pattern /(s|S|ß)/ would match only once, not twice.

In my view, the pattern /(s|S)/ in an otherwise case-sensitive match is much more understandable in its effects than the pattern /(?i)s/ under the criterion you gave.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS caseless matching on properties
PostPosted: Fri Jan 14, 2011 4:13 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
A criterion that says
  • A pattern is case insensitive if and only if it matches the same strings as a pattern in which all literals and set members x are replaced by the full set characters and sequences y for which toCaseFold(y) = toCaseFold(x)

The matching with such a re-written pattern would then be against the original string S.

This criterion has the advantage of eliminating certain gotchas. For example, under the prior definition, the pattern /(?i)a/ inexplicably matches not only "a" and "A" but also a single one of several dozen characters consisting of a with a diacritic. That character is U+1E9A ( ẚ ) LATIN SMALL LETTER A WITH RIGHT HALF RING. Other characters such as precomposed "ä" or "â" are not matched.

While it might be desirable to treat 1E9A as equivalent to its compatibility decomposition 0061 02BE for some searches, to have this behavior turned on during case insensitive matches, while at the same time not decomposing (canonically) "ä" or "â", is inconsistent and unexpected.

Under the new criterion, the behavior would be regular. Only strings actually containing an a or and A would be matched. If one wanted to match A WITH HALF RING one would include it in the expression explicitly /(?i)(a| ẚ )/, at which point all its case equivalents would match as well.

Now, to implement this criterion as stated is really wasteful. Therefore, it needs to be optimized. As long as all casefoldings are 1:1 (direct character substitutions), casefolding the target string and the pattern is equivalent to the criterion proposed here.

The difference shows up only when the casefoldings are not 1:1. In that case, one could either rewrite the pattern as suggested in the formal definition, or one could still casefold the target string and pattern as before, but require that the entire right hand side of the case folding be matched atomically. E.g., if casefolding a string produces 'ss' from an 'ß', then either the pattern matches the 'ss' or it doesn't match. In other words, disallow matches on a partial case folding (individual 's' in this example).


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Sun Jan 16, 2011 1:25 pm 
Offline

Joined: Sun Jan 02, 2011 1:10 pm
Posts: 5
As a native English speaker, I would never expect caseless matching to match different words, just the same word, differently capitalized. I have a problem, therefore, thinking it all right that German Maße (measures) and Masse (mass) should match caselessly. Could someone try to explain some circumstances where this is what would be desired?

Another consideration should be what to do about regular expression languages in which caseless matching can apply to just parts of a regular expression. That is the (?i) doesn't have to come at the beginning, as it does in the examples used so far in this thread, but something like /Abc(?i)def(?-i)ghi/ where (?-i) turns off case insensitive matching. In that regular expression only def would match something of a different case.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Sun Jan 16, 2011 2:41 pm 
Offline

Joined: Tue Jan 11, 2011 11:51 pm
Posts: 5
khw wrote:
As a native English speaker, I would never expect caseless matching to match different words, just the same word, differently capitalized.


Really? I would certainly expect caseless matching to match polish and Polish, mosaic and Mosaic, which are distinct words differing only by capitalization. This property is not desirable, but it is unavoidable; caseless matching just isn't that clever.


Top
 Profile  
 
 Post subject: Re: Draft proposal for UTS18 caseless matching on properti
PostPosted: Sun Jan 16, 2011 2:44 pm 
Offline

Joined: Tue Jan 11, 2011 9:37 pm
Posts: 1
The official rules for German orthography stipulate the following:

Ҥ 25
...
E2: Steht der Buchstabe ß nicht zur Verfügung, so schreibt man ss. In der
Schweiz kann man immer ss schreiben. Beispiel: Straße – Strasse
[If the letter ß is not available one writes ss. In Switzerland one can always write ss. Example: Straße – Strasse]

E3: Bei Schreibung mit Großbuchstaben schreibt man SS, zum Beispiel:
Straße – STRASSE
[When writing in capital letters one writes SS, for example: Straße – STRASSE]”

(Cited after the Institut für Deutsche Sprache at http://www.ids-mannheim.de/ ; the document’s exact URL is: http://www.ids-mannheim.de/service/reform/regelwerk.pdf.)

Now in German there are three words that are all MASSE when written in capital letters but different from one another in normal non-Swiss orthography:

Masse (‘mass’)
Maße (‘measures, dimensions’; plural form of “Maß”)
maße (occurring in “ich maße mir an” ‘I arrogate’; a form of “sich anmaßen”)

Hope this helps,

Charlie


khw wrote:
As a native English speaker, I would never expect caseless matching to match different words, just the same word, differently capitalized. I have a problem, therefore, thinking it all right that German Maße (measures) and Masse (mass) should match caselessly. Could someone try to explain some circumstances where this is what would be desired?

Another consideration should be what to do about regular expression languages in which caseless matching can apply to just parts of a regular expression. That is the (?i) doesn't have to come at the beginning, as it does in the examples used so far in this thread, but something like /Abc(?i)def(?-i)ghi/ where (?-i) turns off case insensitive matching. In that regular expression only def would match something of a different case.


Top
 Profile  
 
 Post subject: Re: Caseless matching for literals and literal set members
PostPosted: Sun Jan 16, 2011 3:21 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
khw wrote:
As a native English speaker, I would never expect caseless matching to match different words, just the same word, differently capitalized.


I think you are on to something, but I would put it differently. In my view, users would never expect caseless matching in situation where you couldn't match the target by changing the case of the regex. (I'll write down more formally what that means in another post).

khw wrote:
I have a problem, therefore, thinking it all right that German Maße (measures) and Masse (mass) should match caselessly. Could someone try to explain some circumstances where this is what would be desired?


Charlie answered that nicely, but with perhaps too much detail. It boils down to that
Quote:
there are three words that are all [spelled] MASSE when written in capital letters but different from one another ...:

Masse (‘mass’)
Maße (‘measures, dimensions’; plural form of “Maß”)
maße ([a verb form])


So when you are matching on SS you would also match the all-caps version of words that contain ß in their lower case. So, under any reasonable definition of case insensitive matching, you should match both forms with ss and ß. Same logic applies to a regex containing ß, it should match ss and SS case insensitively.

The question is, what should a regex match that just contains "s" or "S" ? Should it match ß? The answer to that question is not obvious, and my opinion on the best way to handle this case differs for now from that held by others in this issue.


Top
 Profile  
 
 Post subject: Re: Caseless matching for literals and literal set members
PostPosted: Sun Jan 16, 2011 4:10 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
We are discussing what the definition of case-insensitive matching should be (given Unicode's definition of CaseFolding as a background).

I wrote:
Quote:
Users would never expect caseless matching in situation where you couldn't match the target by changing the case of the regex.


In the example we just had, a regex /ss/ would match "masse". You could rewrite the regex by giving it uppercase, or title case /SS/ or /Ss/ or mixed case /sS/ and then you would match MASSE or any mixed case forms.

But ß is a lower case equivalent of SS (even though its not the lower case), so "changing the case of the regex" would encompass rewriting it as /ß/. With that, you match "maße" and "Maße" etc.

SS is also the upper case for a pair of long s characters or mixed pairs of long s and s. So you could reasonably even rewrite the regex to /ſſ/ or /ſs/ etc. and match "maſſe" and "Maſse", etc.

Finally, with the new capital sharp s, you have a choice to make. The character ẞ is not really a cased form of ss or SS, only properly an upper case of ß. Strictly speaking you could have an argument that a regex /ss/ or /SS/ has no business matching ẞ, even under case insensitive matching.

However, you could also make the argument that is is useful to include all possible case, even such seemingly "second degree" equivalents so you get closure. In other words, because /SS/ and /ß/ are case equivalents, they should match the same set of target strings if matched case insensitively.

In contrast to the above, if you started with a regex that contained only /s/, then the only case equivalents of a single "s" are s, S and ſ. I disagree with the approach that includes all case foldings that merely happen to include some s.

Mark gave this set as [sSſ ßẞ stſt] and admitted that one would get "funny overlaps". This set not only includes the forms of ß, but also various st ligatures.

Now, if my regex doesn't match an st ligature when its case-sensitive, I don't expect that it should match such a ligature merely because I turn on case-insensitive matching. If I want ligatures to be matched by their constituents, I would write the regex or the matching rules appropriately, so all such ligatures are matched independent of case folding.

I very clearly also never want to match capital sharp s "ẞ" on /s/, because, unlike for lower case ß, no case transform of ẞ leads directly to any form with 's'.

Equally, if I really want to match on a lower case ß, then requiring that my regex include either ß (or two "s" characters plus case insensitive match) seems appropriate.

I've tried to capture this in the following definition that I posted earlier:

Quote:
A pattern is case insensitive if and only if it matches the same strings as a pattern in which all literals and set members x are replaced by the full set or characters or sequences y for which toCaseFold(y) = toCaseFold(x)

The matching with such a re-written pattern would then be against the original string S (not the case folded S).


A few minutes consideration will lead one to realize that an actual implementation would not have to rewrite the regex, but could implement case insensitive matching in the usual manner (casefolding both regex and target) with some additional restrictions that prevent the "funny overlaps" from triggering spurious matches.


Top
 Profile  
 
 Post subject: Re: Caseless matching for literals and literal set members
PostPosted: Sun Jan 23, 2011 4:40 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
There are really two different issues that are getting mingled.

1. how to treat properties

2. how to handle multicharacter case folds.

I'll break those into two different topics for clarity.

The thread most closely continuing the discussion here is "Regex caseless matching: multichars".
— ForumAdmin


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic This topic is locked, you cannot edit posts or make further replies.  [ 12 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com