L2/12-187

Date/Time: Mon May 7 23:14:32 CDT 2012
Contact: tchrist@perl.com
Name: Tom Christiansen
Report Type: Public Review Issue
Opt Subject: feedback on pri182


This is a piece of mail I sent to Mark Davis this morning.
It has two different concerns. The first has to do with various
forms of case-insensitive matching, as discussed in pri182.
The second has to do with errors in the current definitions
of \R and \X in tr18. I have made very minor corrections to
the mail compared to the form in which it was originally sent.
Please also consult Russ Cox’s Karl Williamson’s followup
message which include further and better examples than my own.

--tom

Tom Christiansen
Boulder, Colorado
======================================

From:          Tom Christiansen <tchrist@chthon>
To:            <mark@macchiato.com>
CC:            karl williamson <public@khwilliamson.com>,
               Russ Cox <rsc@google.com>
Date:          Mon, 07 May 2012 07:57:08 MDT
Subject:       Re: Preview of working draft for #18
In-Reply-To:   <CAJ2xs_GQN_majaKM6vszFr45S=JyBZfbrxpeWj687XcuDj7e6A@mail.gmail.com>


[Russ cc'd to make sure I'm not talking about him behind his back :)
 and so he can correct me if I've unwittingly misstated something 
 relevant to his work.]

Mark Davis <mark@macchiato.com> wrote on Sun, 06 May 2012 21:16:26 -0700:

> I hadn't heard anything from either of you on the note I sent you back in
> February about the proposed update to #18:
> http://www.unicode.org/review/pri182/.

> I want to make sure you realize that during the meeting this week, the UTC
> may approve that draft. So if you do have any feedback, it should be sent
> in in as soon as possible. It does make some conformance changes, so it
> would be very helpful if you could review.
 
Mark, 

Thank you for the note.  I've looked at pri182, and agree with most of the
changes without any comment.

I've also read over Karl's feedback, and think he addresses the issues of 
case-insensitive matching pretty well.  It's a real conundrum.

In recent discussions I've had with Russ Cox, a couple of other things have 
come up that might stand consideration.  The first has bearing on Karl's
feedback regarding case-insensitive matching of properties, so I am sending
it to you directly.

The issue involves transitivity.  It's most easily illustrated by considering
what dot — or if you would, \p{Any} — should match under case-insensitive mode.

Since dot can match any code point, it could for example match U+00DF.  But
U+00DF could match such things as "ss", "SS", paired long s's, etc.  So now
suddenly you have \p{Any} matching more than any, which is darned weird.

This is the problem that Karl discovered with something like [\x80-\xFF]
under case-insensitive mode.  Notice that 0xDF is in that range, which means
that "SS" et alios are also in that range.  It is very bizarre, or at least
counterintuitive for traditional regexes, to have a bracketed character
class match multiple code points, since just as with dot, a bracketed
character class "smells" like it "should" match a single code point.

This becomes even worse with negated classes.  Consider that [^\x80-\xFF]
matches something that cannot be matched by something in that range.
Since 0xDF is in that range, under full casefolding trying to match the
string "guess" against a pattern like /(?i)^[^\x80-\xFF]+$/ would 
necessarily fail, since "guess" contains within it an instance of 
something that one of its range elements *could* match, since one
of its range elements is 0xDF.  In Perl case-insensitive notation,

    "guess" =~ /\xDF/i

is true (and the length of the match is 2).  Therefore its logical
complement

    "guess" =~ /^[^\x80-\xFF]+$/i

is necessarily false.  It is shocking to say that a purely ASCII string
can *fail* to match something that asserts it has no Block=Latin1 characters
in it.  

This paradox held up our release last year trying to figure out what to
do about this.  In the end, we rescinded full casefolding on inverted
bracketed character classes alone, but left it intact elsewhere.

I don't know whether that was the best solution.  Actually, I'm pretty
sure it isn't.  I do not at all like things working differently under
different circumstances, because that will again lead to paradoxes,
and paradoxes to surprises, and surprises to incorrect code.

And above all, we don't want incorrect code.

If you're looking for prior art, I know of only three pattern matching
languages that support full casefolding.  These are Perl, Ruby, and 
Matthew Barnett's "regexp" library for Python.

Russ Cox reminds me that casefolding rules should apply to everything,
and that this leads to paradoxes.  Would you expect this to fail?

    "guess" =~ /^\P{Block=Latin1}+$/i

Does not "guess" consist of all non-Latin1 character?  Well, sure, but
the case-insensitive mode throws it open to the paradox.  In the same
way, he holds that 

    "guess" =~ /^.+$/i

or

    "guess" =~ /^\p{Any}+$/i

should also fail due to transitivity.  

I see his point, but I don't like where all this leads.

EDITED TO ADD: Here is a portion of Russ's mail:

> However, it does seem to me that since /ß/i matches "ss" and . or
> \p{Any} are logically a large alternation of individual characters,
> that implies that /./i matches "ss", in the same way that /(A-Z)+/i
> matches "abc".  So I think that there are certain unexpected
> _positive_ matches implied by full case folding.  I would hesitate to
> suggest that there are unexpected negative matches, since case folding
> typically only adds alternatives.  (The exception would be expressions
> like /^[^class]+$/ in which the negation of an unexpectedly enlarged
> class turns out to be unexpectedly small.)
> 
> As an illustration of the positive match problem, I find it very hard
> to draw a dividing line anywhere in this sequence:

>   "guess" =~ /^guess$/i
>   "guess" =~ /^gueß$/i
>   "guess" =~ /^gue[ß]$/i
>   "guess" =~ /^gue[ß]$/i
>   "guess" =~ /^gue[\xC0-\xFF]$/i
>   "guess" =~ /^gue[\x00-\x{10FFFF}]$/i
>   "guess" =~ /^gue.$/i
>   "guess" =~ /^....$/i
>   "guess" =~ /^.{4}$/i
>   
>  And yet I think we all agree that the first two are necessarily true
>  and that the last is somewhat ridiculous.  At what point, though, does
>    it go wrong?

As you know, Mark, I believe the only sane way to do this is to rewrite 

    full_casefold(STRING) =~ /PATTERN/

That is, apply the mapping on the string, then write the pattern with that  
in mind, and have no (?i) mode operant.  

I do not mean to dissuade the committee from recommending this, but
be aware that there remain at least three problems with this approach:

    1. performance: you need auxiliary space to hold the casefold of
       each string you ever wish to match case-insensitively

    2. correctness: you now force people to understand the casefold of
       arbitrary strings, which is really asking a great deal.  But they
       have to know that in order to write that pattern correctly.

    3. atomicity: it is common to wish to apply case-insensitivity to 
       only particular subportions of a match, but not to others.  For example:

        STRING =~ /PAT1(?i:PAT2)PAT3/

       This is normal, where only PAT2 but not PAT1 nor PAT3 are matched
       irrespective of case.   This cannot be accomplished using the
       strategy of mapping the string to be matched against to its casefold;
       that is, there is no way to write this correctly:
        full_casefold(STRING) =~ /PAT1(?i:PAT2)PAT3/

       Because the remapping has already been already done on the whole
       string.  It is an indivisible mapping, and that is too much atomicity
       for customary use.

All these same issues arise equally with any notion of "canonical matching", or 
"matching as though the string were in NFD," or any other normalization form.
That is, this approach:

    nfd(STRING) =~ /PATTERN/

is one that I believe is necessary for sanity, but it has all the same three
problems/issues/concerns as I have just raised regarding matching done in 
case-insensitive mode: it takes both time and space, it requires sophistication
(=expertise) and care (=diligence) on the pattern's author, and it is an
indivisible all-or-nothing step.

EDITED TO ADD:
   Even if the committee follows up with pri182's recommendation to relax the requirement of full casefolding to allow for matching to be done using simple casemapping alone, and to require a mapping as I have shown above for full casefolding to avoid the more egregious of the paradoxes, all three of my issues (performance, correctness, atomicity) still apply to the rewrite scheme. It might be worth mentioning these neccesarily attendant issues for those who decide to pursue using the rewrite approach.

 Furthermore, some paradoxes remain even under simple casefolding. This is because the simple casefolds of LATIN SMALL LETTER LONG S and KELVIN SIGN both have simple casefolds in the ASCII range.  

Therefore this:

        "guess" =~ /^\P{Block=Latin Extended-A}+$/i

which says much contain only code points that are *not* from that block, would *fail* even though it indeed contains only code points that are not in that block. That's because 
LATIN SMALL LETTER LONG S simple casefolds to plain s, which is 
present in the string (twice).  Therefore it would fail, and really shock people (I feel).

Nobody is expecting this paradox.  Notice how getting rid of full casefolding does not banish all paradoxes!

 Karl further reports:

> The proposed draft does explicitly use the Phonetic_Extensions block as 
> an example of one that isn't closed under /i.  I submit that the above 
> text, which appears to have come from Russ, suggests that he and Tom, 
> like me, would be surprised that blocks match differently under /i vs 
> not.  And that reinforces my point that doing so will lead to bugs and 
> programmer time wasted trying to figure out what's going on.  Asmus last 
> year was of the opinion that all properties should be immune to /i 
> matching.  It's clear from Perl user feedback that they do expect the 
> casing properties to match differently.  But, I believe they would be 
> surprised, and unhappy, to find that /\p{ASCII}/i matches the KELVIN 
> SIGN and LATIN SMALL LETTER LONG S.

END ADDITION

=========================

The second issue is one that has been staring at us for a long time in tr18.
It does not relate to the proposed changes, however, and so I have not
submitted it against the proposed update.  It is still a real issue, though,
and needs to be clarified in some future draft, because otherwise implementers
cannot know to do it consistently (or arguably, correctly).

Karl and I, and Russ and I alike, have discussed how there is a genuine
problem with tr18's definition of \R (a linebreak grapheme) and \X (any
extended grapheme cluster).  The definitions given in tr18 for those 
**ARE NOT SUFFICIENT** to guarantee correct behavior.  This must be fixed.

Again using Perl notation,

    "\r\n" =~ /^\R$/

or more explicitly

    "\r\n" =~ /\A\R\z/

works correctly given that \R must mean

    (?:\r\n|\p{VertSpace})

or, in (?x) mode for legibility / breathing room:

    (?: \r\n | \p{VertSpace} )
However, that simplistic definition causes these to match:

    "\r\n" =~ /^\R{2}$/
    "\r\n" =~ /^\R\p{Any}$/

and that is wrong: you can't back up and reconsider an alternate
match for the linebreak grapheme.

That's why Perl implements \R as a "possessive" match.

    (?>\r\n|\p{VertSpace})

That means that once it's decided on \r\n, it can never backup
to retry the second branch in the alternative.

You can do it that way in a backtracking matcher, but not in traditional, non-
backtracking finite automaton.  There you must write it with a lookahead
negation, such as having \R actually match

    (?: \r\n 
      | \p{VertSpace} 
    )
    (?! \p{VertSpace} )

That is, it forces the longest possible sense of each \R linebreak grapheme.

EDITED TO ADD:
  That of course is not correct.  Russ Cox mentions in his own mail the following:

> Tom's suggested definition for \R is not quite right, I don't think,
> as it prohibits \p{VertSpace} before another \p{VertSpace}.  A
> possibly better definition might be
>
>     (?:\r\n|(?!\r\n)\p{VertSpace})
>
> Something similar should be possible for \X.

END ADDITION

Again, this is not implementable in a matcher that forbids lookaheads.
However, I believe (and Russ will please correct me if I am wrong)
that that restriction *could* be suspended for a one-codepoint lookahead
in a way that does not compromise the computability requirements of the
matcher.  This I believe because such matchers can already implement 
boundaries like \b and \B, so I think they can also implement this sort
of thing, were they so inclined.

EDITED TO ADD: Russ confirms that this can be indeed still be done in a DFA/NFA environment. See his mail for details. END ADDITION

The same issue arises with \X.  Consider

    nfd("café") =~ /^caf\X$/            # this must pass
    nfd("café") =~ /^caf\X.$/           # this must FAIL

You cannot ever interpret \X as matching the "e" alone to leave
room for dot to then match COMBINING ACUTE ACCENT, because that
would cause your \X to break at the wrong kind of boundary.

Perl addresses this by implementing \X as a possessive match.

All this is far simpler to explain if we pretend that \X matches (?:\PM\pM*).  
It *doesn't*, of course, because an extended grapheme cluster is quite a bit
more complex than that.  However, that simplistic toy definition suffices to
illustrate the problem.

Suppose that \X really meant

    (?:\PM\pM*)

then the backtracking would mess up the second case.    
This would match, but it needs to fail:

    nfd("café") =~ /^caf\X.$/           # this must FAIL

That's why it has to be one of 

    (?>\PM\pM*)

or

    (?:\PM\pM*+)

depending on your notational preferences.

Again, this poses a problem for NFA/DFA implementations, since backtracking 
is meaningless there.  You have to rewrite it all with a lookahead conformant
with the grapheme boundary conditions spelt out in tr29 on Unicode Text
Segmentation.

EDITED TO ADD:
     I believe that a non-backtracking engine (thus without "possessive" matches) could implement this (admittedly toy) definition of a grapheme cluster by changing it to 

    (?:\PM\pM*)(?!\pM)

     That should stop it from breaking up a mark that should be connected.

END ADDITION
      
Off the top of my head, I believe the actual definition for \X must be
more like this (in /x mode of course):

    (?>  \R
      |  \p{Grapheme_Base} \p{Grapheme_Extend}*+
      |  \p{Grapheme_Extend}++
      |  \p{Any}
    )

and where \R is also the possessive version described previously.

Again, that only works if your matcher can do possessive matching. Otherwise
you have to do something much more complicated with special-purpose lookahead
negations.   Which assumes then that they can at least do that.  I think a
single code-point lookahead should again suffice here with \X as it did with
\R, but I have not thought the matter through; tr29 is rather sticky.

As with \R, for \X I do not believe that any theoretical concerns regarding
computability are insurmountable, but I know of no NFA/DFA implementation of
\X.  I again hope that Russ will correct me if I am wrong about any of that.

Mark, please feel free to repost, redistribute, or use this letter in
part or in whole in any way you might see fit.

Also please do not hesitate to request that I split up the issues and
submit them as separate feedbacks, although if you do that, I could perhaps
stand some guidance.  

I've put it all in this long note because of the urgent timing.  If you
have any questions or comments about any of this, I will try to respond
with all possible alacrity, in case it may have bearing on your meeting.

Hope this helps,

--tom

    Tom Christiansen
    Boulder, Colorado