**From:** Hans Aberg (*haberg@math.su.se*)

**Date:** Mon Oct 22 2007 - 08:07:28 CDT

**Previous message:**Frank Ellermann: "purl.net/net/cp"**In reply to:**Mark Davis: "Re: FYI: Regex paper for UTC"**Next in thread:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Reply:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

On 13 Oct 2007, at 19:29, Mark Davis wrote:

*> The complement operation is only discussed in UTS #18 with regard
*

*> to character classes, not as a general operation. If you feel the
*

*> text is unclear on that point, perhaps you can look it over and
*

*> suggest where it could be enhanced. You can file this via the
*

*> reporting form.
*

I had a look at this paper, and compared against theory (notation see

below):

In brief, you can add the language intersection and complement

operators to the regular expression language; it is known that they

can be reduced to the ordinary regular expression operators (by

constructing suitable DFAs), and their restrictions to the character

set then agrees with the ordinary set operations on this character

set. This should then also solve the problems with Unicode "grapheme

clusters" (which I will here call "semantic characters"), because

essentially the same technique can be applied to them.

Specifically, let C (see below) be the Unicode abstract character

set; these are atomic characters relative to the character

combination that can also be viewed as characters. If L is a language

(i.e., subset of C*), let [L] denote the subset of strings of length

1 in L, i.e. L ∩ C. If S is a subset of C (a character class), then

C \ S = [∁S], where ∁S is the language complement of S, i.e. C* \

S; and similar for ∩. So writing normal character set expressions

using the language set operations and then applying "[...]" produces

the expected character class result.

Now do the same thing for the Unicode semantic characters, a subset U

of C* together with an equivalence relation ~. A normal form is a

choice of equivalence class representative(s). If L is a language,

let {L} denote the set of strings of semantic length 1, i.e. L ∩ U.

If T is a semantic character class, i.e. a subset of U, then U \ T =

{∁T}. So again, writing normal character set expressions using the

language set operations and then applying "{...}" produces the

expected character class result.

It might also be useful with operators "[...]_K" and "{...}_K", where

K is a natural number or natural number set (range). In the latter

case, one should define a semantic string length 𝓊(x) on strings x

in C* which the index selects, i.e. {L}_K is the set {x ∈ C*| x ∈

L ∧ 𝓊(x) ∈ K}.

Hans Åberg

---- Call the set C is the character set (or vocabulary). Let 𝓟(C) be the power set (set of subsets), called the set of character classes. 𝓟(C) has the natural set operations: union ∪, intersection ∩, difference ∖, complement ∁, symmetric difference ∆; and has the members: the empty set ∅, and the full character class C. Let C* denote the free monoid on C, i.e., the set of strings from C, including the empty string ε. The multiplication ⋅ in C* is also called string concatenation. The unit is ε Every x ∈ C* has a length 𝓵(x). A language L with vocabulary C defined to be a subset of C*; let 𝓛 (C) ≔ 𝓟(C*) denote the set of all languages with vocabulary C. The empty language ∅ should not be confused with the language ε whose only member is the empty string. If L is a language, and k a natural integer, let L_k denote the set of strings of length k. Unless otherwise said, all languages here have vocabulary C. If L, M ∈ 𝓛(C) are two languages, and S a subset of the set of natural numbers, define the following set operations: (complement) ∁L ≔ {x ∈ C*| x ∉ L} (union) L ∪ M ≔ {x ∈ C*| x ∈ L ∨ x ∈ M} (intersection) L ∩ M ≔ {x ∈ C*| x ∈ L ∧ x ∈ M} (difference) L \ M ≔ {x ∈ C*| x ∈ L ∧ x ∉ M} (symmetric difference) L ∆ M ≔ (L \ M)∪(M \ L) And the following monoid operations: (concatenation) L ⋅ M ≔ {x⋅y|x ∈ L ∨ y ∈ M} (exponent) L^k ≔ {x_1⋅...⋅x_k|x_i ∈ L} (monoid closure) L* ≔ ∪_i=0^∞ L^i — zero or more concatenations of L (positive closure) L+ ≔ ∪_i=1^∞ L^i — one or more concatenations of L (S closure) L^S ≔ ∪_i∈S L^i = {x| x ∈ L^i, for some i ∈ S} The monoid closure of L is the submonoid generated by L, and also the smallest submonoid containing L. The monoid closure is also called the Kleene closure. A regular expression language contains symbols representing elements from C, 𝓟(C), C*, the set and monoid operations above, plus parenthesizes, typically a minimum: C, ε, ∪, ⋅, *, ( and ), with the semantics as indicated above, with values the languages in 𝓛 (C), operations on such languages. ----

**Next message:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Previous message:**Frank Ellermann: "purl.net/net/cp"**In reply to:**Mark Davis: "Re: FYI: Regex paper for UTC"**Next in thread:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Reply:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Mon Oct 22 2007 - 08:10:47 CDT
*