RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 22 2007 - 15:16:15 CDT

  • Next message: Hans Aberg: "Re: FYI: Regex paper for UTC"

    Hans Aberg wrote:
    > On 13 Oct 2007, at 19:29, Mark Davis wrote:
    >
    > > The complement operation is only discussed in UTS #18 with regard
    > > to character classes, not as a general operation. If you feel the
    > > text is unclear on that point, perhaps you can look it over and
    > > suggest where it could be enhanced. You can file this via the
    > > reporting form.
    >
    > I had a look at this paper, and compared against theory (notation see
    > below):
    >
    > In brief, you can add the language intersection and complement
    > operators to the regular expression language; it is known that they
    > can be reduced to the ordinary regular expression operators (by
    > constructing suitable DFAs), and their restrictions to the character
    > set then agrees with the ordinary set operations on this character
    > set. This should then also solve the problems with Unicode "grapheme
    > clusters" (which I will here call "semantic characters"), because
    > essentially the same technique can be applied to them.
    >
    > Specifically, let C (see below) be the Unicode abstract character
    > set; these are atomic characters relative to the character
    > combination that can also be viewed as characters. If L is a language
    > (i.e., subset of C*), let [L] denote the subset of strings of length
    > 1 in L, i.e. L ∩ C.

    Note that L may contain strings containing strings like a base letter followed by a diacritic, which is canonically equivalent to its precomposed form. Would only the precomposed form would be allowed in [L] ? The definition of "length" is not precise enough. Forme the composed nas precomposed letters should behave identically, ans so their "length" should be 1 in both case. If so, then [L] will contain BOTH the precomposed letter and the sequence of a letter and a diacritic.

    > If S is a subset of C (a character class), then
    > C \ S = [∁S], where ∁S is the language complement of S, i.e. C* \
    > S; and similar for ∩. So writing normal character set expressions
    > using the language set operations and then applying "[...]" produces
    > the expected character class result.

    In fact it's more complex than that. We can make a distinction between a decomposed letter and a precomposed one, by using the \uNNNN notation. For me the decomposed cluster is recognized implicitly as equivalent to the precompoed ONLY if the letter is encoded in the regexp without using this notation; if we want to mark a cluster represented by \NNNN notations, we need a grouping operator around the cluster such as "\q{a\u0301}", otherwise "a\u0301" is still not recognized as a cluster and not equivalent (for regepxs) to "á"

    On the opposite the "á" could be decomposed or precomposed but would still designate the same character: "á" in a regexp is a class of equivalent strings with two members, but "a\u0301" is still the concatenation of two regexp elements and will NOT match "á" if it is not explicitly grouped. This preserves the compatibility with POSIX regexps and makes sense in expressions like [\u0000-\uFFFE] that will match "á" in precomposed form but NOT in decomposed form.

    If I write "\q{a\0301}", what I mean is that I want to match all equivalents (if you think that using "\q" for that is not the correction option, or is incompatible with your interpretation, define another operator to mean this; for now I accept it not only within "[...]" class definitions but also everywhere in the regexp, so that "[xy-z]" is always treated as equivalent to "(x|y|...|y)".

    The concept of "length" equal to 1 is defined only from the initial representation of the regexp, and not in terms of length of the strings matched by it (because a regexp element of length 1 is a class of strings that may be different on the set of languages instances it matches).

    I make a clear distinction between the length of the elements in a regexp and the length of the elements it matches in the target language, only to preserve the POSIX semantics, and allowing to perform searches about specific normalized or non-normalized forms. A search for [\u0300-\u03FF] for example will only match the decomposed accents in the target string, but not if they are encoded implicitly in some precomposed form: this is possible because I DON'T force the input string into NFD form before processing it (for me it's not reliable and it will return false matches, and will alter the input text if I want to use a regexp to perform automated substitutions).

    Suppose you have an input text "café" encoded in NFC form. It contains 4 characters, not 5. I perform an automated search and replace to remove all non-base characters, for example I want to substitute an non composed acute accent by a grave accent; this search isnot supposed to alter the lenth of the text, so the text MUST not be preprocessed into NFD form: if it is preprocessed, then "café" will be seen as containing 5 elements, the regexp will match the accent, and the substitution will apply to it. I will get finally a "café" followed by the grave accent. The encoded length of the text has been changed unexpectedly.

    What is important to note here: clusters are not recognized magically; the only clusters that I accept to recognize automatically are those coming from the canonical equivalence of regexp strings WITHOUT interpreting the operators it contains (so without interpreting the "\uNNNN" notation which is perceived as a sequence of 6 distinct units at this level). If I want to specify other clustering, I have to use an explicit grouping operator or that, because it is not implied by any Unicode normalization of the regexp string.

    In other words, in my language, [L] contains ALL Unicode codepoints, separately in singleton classes containing only them, plus the separate classes for the decomposable characters that have multiple representations.
    The singleton class matching only "á" in NFC is NOT represented by the regexp "á", but by the regexp "\u00E1". The regexp "á" is parsed on the opposite as the non singleton class containing all its canonical equivalents.

    If I look for matches by the regexp "[a-z]", I am not expecting to find matches within "á" UNLESS this is in decomposed form. But I will find a match with the regexp "[a-z]" when the input text "á" is in decomposed form.

    To make this consistant. I am then NOT performing any normalization of the input text, but I am effectively normalizing the regexp. The canonically equivalent forms of the input text will be matched ONLYif the regexp is using a form where it matches them, but the regexp containing "\uNNNN" will NOT recognize canonical equivalents by default.



    This archive was generated by hypermail 2.1.5 : Mon Oct 22 2007 - 15:19:44 CDT