Re: [icu-support] Semantic issues with case-insensitive regex matching

From: Mark Davis ☕ (mark@macchiato.com)
Date: Sun Nov 07 2010 - 18:07:22 CST

Next message: Martin J. D�rst: "Fwd: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic"

Previous message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

UTS#18 has a discussion of matching with normalization and case-folding. In
practice, however, it turns out to be difficult to implement efficiently.
The matching itself is not that difficult to define:

Given a pattern P and a text transform T, *P is T-insensitive* IFF

- for any two strings S and Z, T(S) = T(Z) if and only if P(S) = P(Z)

So, for example, a case-insensitive pattern like (?i)(d)z should match in:

DZUR
Dzur
ǲur
dzur

(The middle string uses
U+01F2<http://unicode.org/cldr/utility/character.jsp?a=01F2> ( ǲ )
LATIN CAPITAL LETTER D WITH SMALL LETTER Z)

But what is tricky is defining what the capture groups capture (when there
is reordering/growth/shrinkage) -- like in the" (d)" in the pattern above,
and doing the matching in an efficient way. If you have some suggestions for
defining such operations in such a way that they can be efficiently
implementable, it would be useful to start the discussion.

Mark

*— Il meglio è l’inimico del bene —*

On Sun, Nov 7, 2010 at 13:57, karl williamson <public@khwilliamson.com>wrote:

> I submitted the text below to the unicore mailing list, and got no good
> answer, except recently, to try this list instead. I couldn't find in
> the ICU documentation how the issues that this message raises are dealt
> with. I'm hopeful someone here will respond.
> ----
>
> It would be good if TR18 were enhanced with more discussion of case
> insensitive matching. Chapter 3 of the standard defines the Default
> Caseless Matching algorithm, but it applies only to two strings, and
> extending it to apply to patterns is not trivial, and is totally
> unspecified, as far as I have seen.
>
> In particular, the use of a property in a regular expression pattern
> with caseless matching introduces a number of issues that I don't
> believe are addressed anywhere in the standard.
>
> For example, should 'N' =~ /\p{Gc=Lowercase_Letter}/i
> should 'n' =~ /\p{Gc=Uppercase_Letter/i
>
> I thought the answer was true to both these, but then, what about
> "\N{MICRO SIGN}" =~ /\p{Block=Greek}/i
> "\N{MICRO SIGN}" =~ /\p{Script=Greek}/i
>
> because the fold of MICRO SIGN is in the Greek block and script? It
> doesn't seem right to me that a character should match a different
> script than the one it's in under caseless matching. Similarly, there
> are a number of characters whose fold has a different Age, Soft_Dotted,
> East_Asian_Width, Math, Decomposition_Type, Line_Break, or
> Full_Composition_Exclusion property value, besides the ones I would
> expect, like Changes_When_Case_Folded, and General_Category. The
> YPOGEGRAMMENI, as always, introduces even more.
>
> So perhaps caseless matching shouldn't apply to some properties? If so,
> which ones should be spelled out. Certainly, some properties should
> have caseless matching rules. For example, I believe,
>
> "A" =~ /\p{Name=Latin Small Letter A}/i
>
> should match. Here's another example where allowing the property to
> match any case can lead to problems.
>
> "\N{LATIN SMALL LIGATURE FF}" =~
> /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i
>
> The pattern seems to indicate that only ASCII digits are desired; yet it
> could match something non-ASCII, potentially leading to a spoofing attack.
>
> TR18 is also silent on another issue I've brought up before, and gotten
> no response to. A number of languages, including ICU I believe, allow
> for regular expression capture buffers. These allow for saving some
> portion(s) of the original string that matched some sub-part of the
> pattern. But when you convert the string into something else for
> matching, such as normalizing it, and then match against that, and you
> have capture buffers, those buffers should return not some portion of
> the converted string, but the corresponding portion of the original,
> which you may not be able to get back to. This can happen even without
> normalization if the string folds to more than one character:
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i
>
> should match, as should
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i
>
> Hence, so should
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i
>
> But, the parentheses mean capture buffers, and there is no 1-to-1
> correspondence between either of these buffers and any atomic part of
> the string. I don't know what should happen here, and I think TR18
> should address this.
>
> So how do I go about getting someone or someones thinking about these
> issues to add to TR18?
>
>
> ------------------------------------------------------------------------------
> The Next 800 Companies to Lead America's Growth: New Video Whitepaper
> David G. Thomson, author of the best-selling book "Blueprint to a
> Billion" shares his insights and actions to help propel your
> business during the next growth cycle. Listen Now!
> http://p.sf.net/sfu/SAP-dev2dev
> _______________________________________________
> icu-support mailing list - icu-support@lists.sourceforge.net
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
>

Next message: Martin J. D�rst: "Fwd: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic"
Previous message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 07 2010 - 18:13:53 CST