mark wrote:
There are 102 characters that map to a sequence when casefolded. The question for these is whether a caseless regex match can and should match them.
I would rephrase that. The question is "Under what circumstances should a regex match them?" or "Which regex should match them?".
mark wrote:
As I remarked earlier, I think it would take some work to put together a solid proposal on how to handle these in regex expressions, so that, say any expression that matched:
* OFFICE // 6 chars
would also match
* office // 4 chars, including "ffi" ligature
and vice versa.
And I would reply that this is an entirely different proposal.
There are two different levels of equivalency here.
OFFICE and office (same number of character) are
case equivalent
office and office (6 and 4 characters) are
ligature equivalent
Just because ligatures do not have standard case pairs (there's no FFI ligature) should not mean that caseless matching also becomes
ligature-blind matching.
In other words, if a regex, such as /office/ doesn't match 'office', with the ligature, then making the search caseless, should not necessarily include it in the match.
Note, I'm not saying you shouldn't be able to easily express a search mode that ignores ligatures, but it should not by default be caseless matching.
For user-friendly Unicode regex you may need a mode that ignores several different aspects of how a character can be represented in Unicode all at once - this gets back to the discussion of selective foldings.