detecting case context

From: Theo Veenker ([email protected])
Date: Thu Mar 24 2005 - 08:32:09 CST

Next message: Patrick Andries: "Re: Security Issues"

Previous message: Chris Jacobs: "Re: Security Issues"
Next in thread: Markus Scherer: "Re: detecting case context"
Reply: Markus Scherer: "Re: detecting case context"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Could someone who has implemented the functions for detecting the
case context of a character in a string please look at the code parts
below please. This is for 4.1.0.

The descriptions for Final_Sigma and Before_Dot are clear to me. For
After_Soft_Dotted, More_Above and After_I don't see how the descriptions
and the regexps represent *exactly* the same thing. For these I don't
see the \p{cc=0} parts reflected in the descriptions. Also isn't the
After_I regexp missing a "*"?

The functions below represent what I make of the descriptions and the
regexps. Are they correct?

TIA,
Theo

Final_Sigma

C is preceded by a sequence consisting of a cased letter and a
case-ignorable sequence, and C is not followed by a sequence
consisting of an ignorable sequence and then a cased letter.

Regexp Before C: \p{cased} (\p{case-ignorable})*
Regexp After C: ! ( (\p{case-ignorable})* \p{cased} )

bool inFinalSigmaContext(const wchar* s, int i)
{
bool after_cased = false;

     int j = i;
     while (--j >= 0) {
             if (isCased(s[j])) {
            after_cased = true;
            break;
        }
        if (!isCaseIgnorable(s[j])) break;
     }
     if (!after_cased) return false;

bool before_cased = false;

     while (s[++i]) {
             if (isCased(s[i])) {
            before_cased = true;
            break;
        }
        if (!isCaseIgnorable(s[i])) break;
     }

return !before_cased;
}

After_Soft_Dotted

Character C is in After_Soft_Dotted context if the last preceding
character with a combining class of zero before C was Soft_Dotted,
and there is no intervening combining character class 230 (ABOVE).

Regexp Before C: [[\p{Soft_Dotted}] ([^\p{cc=230} \p{cc=0}])*
(I Assume the leading "[[" is a typo)

bool inAfterSoftDottedContext(const wchar* s, int i)
{
bool after_SD = false;

     while (--i >= 0) {
             if (isSoftDotted(s[i])) {
            after_SD = true;
            break;
        }
             int cc = getCanonicalCombiningClass(s[i]);
        if (cc == CCC_ABOVE || cc == 0) break;
     }

return after_SD;
}

More_Above

C is followed by one or more characters of combining class 230 (ABOVE)
in the combining character sequence.

Regexp After C: [^\p{cc=0}]* [\p{cc=230}]

bool inMoreAboveContext(const wchar* s, int i)
{
bool more_above = false;

     while (s[++i]) {
             int cc = getCanonicalCombiningClass(s[i]);
        if (cc == CCC_ABOVE) {
            more_above = true;
            break;
        }
        if (cc == 0) break;
     }

return more_above;
}

Before_Dot

Character C is in Before_Dot context if C is followed by combining
dot above (U+0307). Any sequence of characters with a combining class
that is neither 0 nor 230 may intervene between the current character
and the combining dot above.

Regexp After C: ([^\p{cc=230} \p{cc=0}])* [\u0307]

bool inBeforeDotContext(const wchar* s, int i)
{
bool before_dot = false;

     while (s[++i]) {
             if (s[i] == 0x0307) {
            before_dot = true;
            break;
        }
             int cc = getCanonicalCombiningClass(s[i]);
        if (cc == CCC_ABOVE || cc == 0) break;
     }

return before_dot;
}

After_I

Character C is in After_I context if the last preceding base character
was an uppercase I, and there is no intervening combining character
class 230 (ABOVE).

Regexp Before C: [I] ([^\p{cc=230} \p{cc=0}])
(shouldn't this be: [I] ([^\p{cc=230} \p{cc=0}])*

bool inAfterIContext(const wchar* s, int i)
{
bool after_I = false;

     while (--i >= 0) {
             if (s[i] == 'I') {
            after_I = true;
            break;
        }
             int cc = getCanonicalCombiningClass(s[i]);
        if (cc == CCC_ABOVE || cc == 0) break;
     }

return after_I;
}

// Or following the regexp:

bool inAfterIContext(const wchar* s, int i)
{
bool after_I = false;

     if (i >= 2) {
             int cc = getCanonicalCombiningClass(s[i-1]);
             if (s[i-2] == 'I' && !(cc == CCC_ABOVE || cc == 0))
            after_I = true;
     }

return after_I;
}

Next message: Patrick Andries: "Re: Security Issues"
Previous message: Chris Jacobs: "Re: Security Issues"
Next in thread: Markus Scherer: "Re: detecting case context"
Reply: Markus Scherer: "Re: detecting case context"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 24 2005 - 08:25:17 CST