Re: [icu-support] Semantic issues with case-insensitive regex matching

From: Mark Davis ☕ (mark@macchiato.com)
Date: Sun Nov 07 2010 - 18:07:22 CST

  • Next message: Martin J. Dürst: "Fwd: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic"

    UTS#18 has a discussion of matching with normalization and case-folding. In
    practice, however, it turns out to be difficult to implement efficiently.
    The matching itself is not that difficult to define:

    Given a pattern P and a text transform T, *P is T-insensitive* IFF

       - for any two strings S and Z, T(S) = T(Z) if and only if P(S) = P(Z)

    So, for example, a case-insensitive pattern like (?i)(d)z should match in:

    DZUR
    Dzur
    Dzur
    dzur

    (The middle string uses
    U+01F2<http://unicode.org/cldr/utility/character.jsp?a=01F2> ( Dz )
    LATIN CAPITAL LETTER D WITH SMALL LETTER Z)

    But what is tricky is defining what the capture groups capture (when there
    is reordering/growth/shrinkage) -- like in the" (d)" in the pattern above,
    and doing the matching in an efficient way. If you have some suggestions for
    defining such operations in such a way that they can be efficiently
    implementable, it would be useful to start the discussion.

    Mark

    *— Il meglio è l’inimico del bene —*

    On Sun, Nov 7, 2010 at 13:57, karl williamson <public@khwilliamson.com>wrote:

    > I submitted the text below to the unicore mailing list, and got no good
    > answer, except recently, to try this list instead. I couldn't find in
    > the ICU documentation how the issues that this message raises are dealt
    > with. I'm hopeful someone here will respond.
    > ----
    >
    > It would be good if TR18 were enhanced with more discussion of case
    > insensitive matching. Chapter 3 of the standard defines the Default
    > Caseless Matching algorithm, but it applies only to two strings, and
    > extending it to apply to patterns is not trivial, and is totally
    > unspecified, as far as I have seen.
    >
    > In particular, the use of a property in a regular expression pattern
    > with caseless matching introduces a number of issues that I don't
    > believe are addressed anywhere in the standard.
    >
    > For example, should 'N' =~ /\p{Gc=Lowercase_Letter}/i
    > should 'n' =~ /\p{Gc=Uppercase_Letter/i
    >
    > I thought the answer was true to both these, but then, what about
    > "\N{MICRO SIGN}" =~ /\p{Block=Greek}/i
    > "\N{MICRO SIGN}" =~ /\p{Script=Greek}/i
    >
    > because the fold of MICRO SIGN is in the Greek block and script? It
    > doesn't seem right to me that a character should match a different
    > script than the one it's in under caseless matching. Similarly, there
    > are a number of characters whose fold has a different Age, Soft_Dotted,
    > East_Asian_Width, Math, Decomposition_Type, Line_Break, or
    > Full_Composition_Exclusion property value, besides the ones I would
    > expect, like Changes_When_Case_Folded, and General_Category. The
    > YPOGEGRAMMENI, as always, introduces even more.
    >
    > So perhaps caseless matching shouldn't apply to some properties? If so,
    > which ones should be spelled out. Certainly, some properties should
    > have caseless matching rules. For example, I believe,
    >
    > "A" =~ /\p{Name=Latin Small Letter A}/i
    >
    > should match. Here's another example where allowing the property to
    > match any case can lead to problems.
    >
    > "\N{LATIN SMALL LIGATURE FF}" =~
    > /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i
    >
    > The pattern seems to indicate that only ASCII digits are desired; yet it
    > could match something non-ASCII, potentially leading to a spoofing attack.
    >
    > TR18 is also silent on another issue I've brought up before, and gotten
    > no response to. A number of languages, including ICU I believe, allow
    > for regular expression capture buffers. These allow for saving some
    > portion(s) of the original string that matched some sub-part of the
    > pattern. But when you convert the string into something else for
    > matching, such as normalizing it, and then match against that, and you
    > have capture buffers, those buffers should return not some portion of
    > the converted string, but the corresponding portion of the original,
    > which you may not be able to get back to. This can happen even without
    > normalization if the string folds to more than one character:
    >
    > "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i
    >
    > should match, as should
    >
    > "\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i
    >
    > Hence, so should
    >
    > "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i
    >
    > But, the parentheses mean capture buffers, and there is no 1-to-1
    > correspondence between either of these buffers and any atomic part of
    > the string. I don't know what should happen here, and I think TR18
    > should address this.
    >
    > So how do I go about getting someone or someones thinking about these
    > issues to add to TR18?
    >
    >
    > ------------------------------------------------------------------------------
    > The Next 800 Companies to Lead America's Growth: New Video Whitepaper
    > David G. Thomson, author of the best-selling book "Blueprint to a
    > Billion" shares his insights and actions to help propel your
    > business during the next growth cycle. Listen Now!
    > http://p.sf.net/sfu/SAP-dev2dev
    > _______________________________________________
    > icu-support mailing list - icu-support@lists.sourceforge.net
    > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
    >



    This archive was generated by hypermail 2.1.5 : Sun Nov 07 2010 - 18:13:53 CST