From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Thu Mar 24 2005 - 08:32:09 CST
Could someone who has implemented the functions for detecting the
case context of a character in a string please look at the code parts
below please. This is for 4.1.0.
The descriptions for Final_Sigma and Before_Dot are clear to me. For
After_Soft_Dotted, More_Above and After_I don't see how the descriptions
and the regexps represent *exactly* the same thing. For these I don't
see the \p{cc=0} parts reflected in the descriptions. Also isn't the
After_I regexp missing a "*"?
The functions below represent what I make of the descriptions and the
regexps. Are they correct?
TIA,
Theo
Final_Sigma
C is preceded by a sequence consisting of a cased letter and a
case-ignorable sequence, and C is not followed by a sequence
consisting of an ignorable sequence and then a cased letter.
Regexp Before C: \p{cased} (\p{case-ignorable})*
Regexp After C: ! ( (\p{case-ignorable})* \p{cased} )
bool inFinalSigmaContext(const wchar* s, int i)
{
bool after_cased = false;
int j = i;
while (--j >= 0) {
if (isCased(s[j])) {
after_cased = true;
break;
}
if (!isCaseIgnorable(s[j])) break;
}
if (!after_cased) return false;
bool before_cased = false;
while (s[++i]) {
if (isCased(s[i])) {
before_cased = true;
break;
}
if (!isCaseIgnorable(s[i])) break;
}
return !before_cased;
}
After_Soft_Dotted
Character C is in After_Soft_Dotted context if the last preceding
character with a combining class of zero before C was Soft_Dotted,
and there is no intervening combining character class 230 (ABOVE).
Regexp Before C: [[\p{Soft_Dotted}] ([^\p{cc=230} \p{cc=0}])*
(I Assume the leading "[[" is a typo)
bool inAfterSoftDottedContext(const wchar* s, int i)
{
bool after_SD = false;
while (--i >= 0) {
if (isSoftDotted(s[i])) {
after_SD = true;
break;
}
int cc = getCanonicalCombiningClass(s[i]);
if (cc == CCC_ABOVE || cc == 0) break;
}
return after_SD;
}
More_Above
C is followed by one or more characters of combining class 230 (ABOVE)
in the combining character sequence.
Regexp After C: [^\p{cc=0}]* [\p{cc=230}]
bool inMoreAboveContext(const wchar* s, int i)
{
bool more_above = false;
while (s[++i]) {
int cc = getCanonicalCombiningClass(s[i]);
if (cc == CCC_ABOVE) {
more_above = true;
break;
}
if (cc == 0) break;
}
return more_above;
}
Before_Dot
Character C is in Before_Dot context if C is followed by combining
dot above (U+0307). Any sequence of characters with a combining class
that is neither 0 nor 230 may intervene between the current character
and the combining dot above.
Regexp After C: ([^\p{cc=230} \p{cc=0}])* [\u0307]
bool inBeforeDotContext(const wchar* s, int i)
{
bool before_dot = false;
while (s[++i]) {
if (s[i] == 0x0307) {
before_dot = true;
break;
}
int cc = getCanonicalCombiningClass(s[i]);
if (cc == CCC_ABOVE || cc == 0) break;
}
return before_dot;
}
After_I
Character C is in After_I context if the last preceding base character
was an uppercase I, and there is no intervening combining character
class 230 (ABOVE).
Regexp Before C: [I] ([^\p{cc=230} \p{cc=0}])
(shouldn't this be: [I] ([^\p{cc=230} \p{cc=0}])*
bool inAfterIContext(const wchar* s, int i)
{
bool after_I = false;
while (--i >= 0) {
if (s[i] == 'I') {
after_I = true;
break;
}
int cc = getCanonicalCombiningClass(s[i]);
if (cc == CCC_ABOVE || cc == 0) break;
}
return after_I;
}
// Or following the regexp:
bool inAfterIContext(const wchar* s, int i)
{
bool after_I = false;
if (i >= 2) {
int cc = getCanonicalCombiningClass(s[i-1]);
if (s[i-2] == 'I' && !(cc == CCC_ABOVE || cc == 0))
after_I = true;
}
return after_I;
}
This archive was generated by hypermail 2.1.5 : Thu Mar 24 2005 - 08:25:17 CST