detecting case context

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Thu Mar 24 2005 - 08:32:09 CST

  • Next message: Patrick Andries: "Re: Security Issues"

    Could someone who has implemented the functions for detecting the
    case context of a character in a string please look at the code parts
    below please. This is for 4.1.0.

    The descriptions for Final_Sigma and Before_Dot are clear to me. For
    After_Soft_Dotted, More_Above and After_I don't see how the descriptions
    and the regexps represent *exactly* the same thing. For these I don't
    see the \p{cc=0} parts reflected in the descriptions. Also isn't the
    After_I regexp missing a "*"?

    The functions below represent what I make of the descriptions and the
    regexps. Are they correct?

    TIA,
    Theo

    Final_Sigma

    C is preceded by a sequence consisting of a cased letter and a
    case-ignorable sequence, and C is not followed by a sequence
    consisting of an ignorable sequence and then a cased letter.

    Regexp Before C: \p{cased} (\p{case-ignorable})*
    Regexp After C: ! ( (\p{case-ignorable})* \p{cased} )

    bool inFinalSigmaContext(const wchar* s, int i)
    {
         bool after_cased = false;

         int j = i;
         while (--j >= 0) {
                 if (isCased(s[j])) {
                after_cased = true;
                break;
            }
            if (!isCaseIgnorable(s[j])) break;
         }
         if (!after_cased) return false;

         bool before_cased = false;

         while (s[++i]) {
                 if (isCased(s[i])) {
                before_cased = true;
                break;
            }
            if (!isCaseIgnorable(s[i])) break;
         }

         return !before_cased;
    }

    After_Soft_Dotted

    Character C is in After_Soft_Dotted context if the last preceding
    character with a combining class of zero before C was Soft_Dotted,
    and there is no intervening combining character class 230 (ABOVE).

    Regexp Before C: [[\p{Soft_Dotted}] ([^\p{cc=230} \p{cc=0}])*
    (I Assume the leading "[[" is a typo)

    bool inAfterSoftDottedContext(const wchar* s, int i)
    {
         bool after_SD = false;

         while (--i >= 0) {
                 if (isSoftDotted(s[i])) {
                after_SD = true;
                break;
            }
                 int cc = getCanonicalCombiningClass(s[i]);
            if (cc == CCC_ABOVE || cc == 0) break;
         }

         return after_SD;
    }

    More_Above

    C is followed by one or more characters of combining class 230 (ABOVE)
    in the combining character sequence.

    Regexp After C: [^\p{cc=0}]* [\p{cc=230}]

    bool inMoreAboveContext(const wchar* s, int i)
    {
         bool more_above = false;

         while (s[++i]) {
                 int cc = getCanonicalCombiningClass(s[i]);
            if (cc == CCC_ABOVE) {
                more_above = true;
                break;
            }
            if (cc == 0) break;
         }

         return more_above;
    }

    Before_Dot

    Character C is in Before_Dot context if C is followed by combining
    dot above (U+0307). Any sequence of characters with a combining class
    that is neither 0 nor 230 may intervene between the current character
    and the combining dot above.

    Regexp After C: ([^\p{cc=230} \p{cc=0}])* [\u0307]

    bool inBeforeDotContext(const wchar* s, int i)
    {
         bool before_dot = false;

         while (s[++i]) {
                 if (s[i] == 0x0307) {
                before_dot = true;
                break;
            }
                 int cc = getCanonicalCombiningClass(s[i]);
            if (cc == CCC_ABOVE || cc == 0) break;
         }

         return before_dot;
    }

    After_I

    Character C is in After_I context if the last preceding base character
    was an uppercase I, and there is no intervening combining character
    class 230 (ABOVE).

    Regexp Before C: [I] ([^\p{cc=230} \p{cc=0}])
    (shouldn't this be: [I] ([^\p{cc=230} \p{cc=0}])*

    bool inAfterIContext(const wchar* s, int i)
    {
         bool after_I = false;

         while (--i >= 0) {
                 if (s[i] == 'I') {
                after_I = true;
                break;
            }
                 int cc = getCanonicalCombiningClass(s[i]);
            if (cc == CCC_ABOVE || cc == 0) break;
         }

         return after_I;
    }

    // Or following the regexp:

    bool inAfterIContext(const wchar* s, int i)
    {
         bool after_I = false;

         if (i >= 2) {
                 int cc = getCanonicalCombiningClass(s[i-1]);
                 if (s[i-2] == 'I' && !(cc == CCC_ABOVE || cc == 0))
                after_I = true;
         }

         return after_I;
    }



    This archive was generated by hypermail 2.1.5 : Thu Mar 24 2005 - 08:25:17 CST