The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Thu Jul 24, 2014 6:00 pm

All times are UTC - 6 hours [ DST ]


Forum rules


Use this forum for technical discussion of UAXes 11, 14, 15, 24, 29, 31, 34, 42, and 44. Technical discussion of UTSes 6, 10, 18, 22, 39, and 46. Technical discussion of UTRs 16, 17, 20, 23, 25, 26, 33, and 36, as well as the related properties and files in the Unicode Character Database.



Post new topic Reply to topic  [ 2 posts ] 
Author Message
 Post subject: Regex caseless matching: multichars
PostPosted: Sun Jan 23, 2011 5:06 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
On Multicharacter caseless matching.

Since this essentially continues the discussion from the thread "Caseless matching for literals..." I've closed that thread and provided a pointer here.
— ForumAdmin


There are 102 characters that map to a sequence when casefolded. The question for these is whether a caseless regex match can and should match them.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{tocasefold!%3D/^.$/}-\p{whitespace}

That is:

[ẚ ff ffi ffl fi fl ẖ İ ǰ ß ẞ stſt ẗ-ẙ ʼn ᾄᾌᾂᾊᾆᾎᾀᾈᾅᾍ ᾃᾋᾇᾏᾁᾉᾴᾲᾶᾷᾳᾼ ᾔᾜᾒᾚᾖᾞᾐᾘ ᾕᾝᾓᾛᾗᾟᾑᾙῄῂῆῇῃῌ ῖΐΐῒῗ ῤ ὐὔὒὖῦΰΰῢῧ ᾤᾬᾢᾪᾦᾮᾠᾨᾥ ᾭᾣᾫᾧᾯᾡᾩῴῲῶῷῳῼ և ﬔ ﬕ ﬗ ﬓ ﬖ]

(or more compactly: [ßİʼnǰΐΰևẖ-ẚẞὐὒὔὖᾀ-ᾯᾲ-ᾴ ᾶᾷᾼῂ-ῄῆῇῌῒῖῗῢῤῦῧῲ-ῴῶῷ ῼff-stﬓ-ﬗ])

Most of these are 'defective' in that one of a case pair (or triple, counting titlecase) is or was missing from Unicode, which causes an expansion. The characters ß and İ are special in their own ways...

As I remarked earlier, I think it would take some work to put together a solid proposal on how to handle these in regex expressions, so that, say any expression that matched:
* OFFICE // 6 chars
would also match
* office // 4 chars, including "ffi" ligature
and vice versa.


Top
 Profile  
 
 Post subject: Re: Regex caseless matching: multichars
PostPosted: Fri Jan 28, 2011 3:02 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 186
mark wrote:
There are 102 characters that map to a sequence when casefolded. The question for these is whether a caseless regex match can and should match them.


I would rephrase that. The question is "Under what circumstances should a regex match them?" or "Which regex should match them?".


mark wrote:
As I remarked earlier, I think it would take some work to put together a solid proposal on how to handle these in regex expressions, so that, say any expression that matched:
* OFFICE // 6 chars
would also match
* office // 4 chars, including "ffi" ligature
and vice versa.


And I would reply that this is an entirely different proposal.

There are two different levels of equivalency here.

OFFICE and office (same number of character) are case equivalent
office and office (6 and 4 characters) are ligature equivalent

Just because ligatures do not have standard case pairs (there's no FFI ligature) should not mean that caseless matching also becomes ligature-blind matching.

In other words, if a regex, such as /office/ doesn't match 'office', with the ligature, then making the search caseless, should not necessarily include it in the match.

Note, I'm not saying you shouldn't be able to easily express a search mode that ignores ligatures, but it should not by default be caseless matching.

For user-friendly Unicode regex you may need a mode that ignores several different aspects of how a character can be represented in Unicode all at once - this gets back to the discussion of selective foldings.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com