Re: FYI: Regex paper for UTC

From: Hans Aberg (haberg@math.su.se)
Date: Mon Oct 22 2007 - 08:07:28 CDT

  • Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

    On 13 Oct 2007, at 19:29, Mark Davis wrote:

    > The complement operation is only discussed in UTS #18 with regard
    > to character classes, not as a general operation. If you feel the
    > text is unclear on that point, perhaps you can look it over and
    > suggest where it could be enhanced. You can file this via the
    > reporting form.

    I had a look at this paper, and compared against theory (notation see
    below):

    In brief, you can add the language intersection and complement
    operators to the regular expression language; it is known that they
    can be reduced to the ordinary regular expression operators (by
    constructing suitable DFAs), and their restrictions to the character
    set then agrees with the ordinary set operations on this character
    set. This should then also solve the problems with Unicode "grapheme
    clusters" (which I will here call "semantic characters"), because
    essentially the same technique can be applied to them.

    Specifically, let C (see below) be the Unicode abstract character
    set; these are atomic characters relative to the character
    combination that can also be viewed as characters. If L is a language
    (i.e., subset of C*), let [L] denote the subset of strings of length
    1 in L, i.e. L ∩ C. If S is a subset of C (a character class), then
    C \ S = [∁S], where ∁S is the language complement of S, i.e. C* \
    S; and similar for ∩. So writing normal character set expressions
    using the language set operations and then applying "[...]" produces
    the expected character class result.

    Now do the same thing for the Unicode semantic characters, a subset U
    of C* together with an equivalence relation ~. A normal form is a
    choice of equivalence class representative(s). If L is a language,
    let {L} denote the set of strings of semantic length 1, i.e. L ∩ U.
    If T is a semantic character class, i.e. a subset of U, then U \ T =
    {∁T}. So again, writing normal character set expressions using the
    language set operations and then applying "{...}" produces the
    expected character class result.

    It might also be useful with operators "[...]_K" and "{...}_K", where
    K is a natural number or natural number set (range). In the latter
    case, one should define a semantic string length 𝓊(x) on strings x
    in C* which the index selects, i.e. {L}_K is the set {x ∈ C*| x ∈
    L ∧ 𝓊(x) ∈ K}.

       Hans Åberg

    ----
    Call the set C is the character set (or vocabulary).
    Let 𝓟(C) be the power set (set of subsets), called the set of  
    character classes. 𝓟(C) has the natural set operations: union ∪,  
    intersection ∩, difference ∖, complement ∁, symmetric  
    difference ∆; and has the members: the empty set ∅, and the full  
    character class C.
    Let C* denote the free monoid on C, i.e., the set of strings from C,  
    including the empty string ε. The multiplication ⋅ in C* is also  
    called string concatenation. The unit is ε
    Every x ∈ C* has a length 𝓵(x).
    A language L with vocabulary C defined to be a subset of C*; let 𝓛 
    (C) ≔ 𝓟(C*) denote the set of all languages with vocabulary C.  
    The empty language ∅ should not be confused with the language ε  
    whose only member is the empty string. If L is a language, and k a  
    natural integer, let L_k denote the set of strings of length k.  
    Unless otherwise said, all languages here have vocabulary C.
    If L, M ∈ 𝓛(C) are two languages, and S a subset of the set of  
    natural numbers, define the following set operations:
      (complement)    ∁L ≔ {x ∈ C*| x ∉ L}
      (union)         L ∪ M ≔ {x ∈ C*| x ∈ L ∨ x ∈ M}
      (intersection)  L ∩ M ≔ {x ∈ C*| x ∈ L ∧ x ∈ M}
      (difference)    L \ M ≔ {x ∈ C*| x ∈ L ∧ x ∉ M}
      (symmetric difference)  L ∆ M ≔ (L \ M)∪(M \ L)
    And the following monoid operations:
      (concatenation)  L ⋅ M ≔ {x⋅y|x ∈ L ∨ y ∈ M}
      (exponent)       L^k ≔ {x_1⋅...⋅x_k|x_i ∈ L}
      (monoid closure)   L* ≔ ∪_i=0^∞ L^i    — zero or more  
    concatenations of L
      (positive closure) L+ ≔ ∪_i=1^∞ L^i    — one or more  
    concatenations of L
      (S closure)        L^S ≔ ∪_i∈S L^i = {x| x ∈ L^i, for some  
    i ∈ S}
    The monoid closure of L is the submonoid generated by L, and also the  
    smallest submonoid containing L. The monoid closure is also called  
    the Kleene closure.
    A regular expression language contains symbols representing elements  
    from C, 𝓟(C), C*, the set and monoid operations above, plus  
    parenthesizes, typically a minimum: C, ε, ∪, ⋅, *, ( and ), with  
    the semantics as indicated above, with values the languages in 𝓛 
    (C), operations on such languages.
    ----
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 22 2007 - 08:10:47 CDT