# Re: FYI: Regex paper for UTC

From: Hans Aberg (haberg@math.su.se)
Date: Mon Oct 22 2007 - 08:07:28 CDT

• Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

On 13 Oct 2007, at 19:29, Mark Davis wrote:

> The complement operation is only discussed in UTS #18 with regard
> to character classes, not as a general operation. If you feel the
> text is unclear on that point, perhaps you can look it over and
> suggest where it could be enhanced. You can file this via the
> reporting form.

I had a look at this paper, and compared against theory (notation see
below):

In brief, you can add the language intersection and complement
operators to the regular expression language; it is known that they
can be reduced to the ordinary regular expression operators (by
constructing suitable DFAs), and their restrictions to the character
set then agrees with the ordinary set operations on this character
set. This should then also solve the problems with Unicode "grapheme
clusters" (which I will here call "semantic characters"), because
essentially the same technique can be applied to them.

Specifically, let C (see below) be the Unicode abstract character
set; these are atomic characters relative to the character
combination that can also be viewed as characters. If L is a language
(i.e., subset of C*), let [L] denote the subset of strings of length
1 in L, i.e. L ∩ C. If S is a subset of C (a character class), then
C \ S = [∁S], where ∁S is the language complement of S, i.e. C* \
S; and similar for ∩. So writing normal character set expressions
using the language set operations and then applying "[...]" produces
the expected character class result.

Now do the same thing for the Unicode semantic characters, a subset U
of C* together with an equivalence relation ~. A normal form is a
choice of equivalence class representative(s). If L is a language,
let {L} denote the set of strings of semantic length 1, i.e. L ∩ U.
If T is a semantic character class, i.e. a subset of U, then U \ T =
{∁T}. So again, writing normal character set expressions using the
language set operations and then applying "{...}" produces the
expected character class result.

It might also be useful with operators "[...]_K" and "{...}_K", where
K is a natural number or natural number set (range). In the latter
case, one should define a semantic string length 𝓊(x) on strings x
in C* which the index selects, i.e. {L}_K is the set {x ∈ C*| x ∈
L ∧ 𝓊(x) ∈ K}.

Hans Åberg

```----
Call the set C is the character set (or vocabulary).
Let 𝓟(C) be the power set (set of subsets), called the set of
character classes. 𝓟(C) has the natural set operations: union ∪,
intersection ∩, difference ∖, complement ∁, symmetric
difference ∆; and has the members: the empty set ∅, and the full
character class C.
Let C* denote the free monoid on C, i.e., the set of strings from C,
including the empty string ε. The multiplication ⋅ in C* is also
called string concatenation. The unit is ε
Every x ∈ C* has a length 𝓵(x).
A language L with vocabulary C defined to be a subset of C*; let 𝓛
(C) ≔ 𝓟(C*) denote the set of all languages with vocabulary C.
The empty language ∅ should not be confused with the language ε
whose only member is the empty string. If L is a language, and k a
natural integer, let L_k denote the set of strings of length k.
Unless otherwise said, all languages here have vocabulary C.
If L, M ∈ 𝓛(C) are two languages, and S a subset of the set of
natural numbers, define the following set operations:
(complement)    ∁L ≔ {x ∈ C*| x ∉ L}
(union)         L ∪ M ≔ {x ∈ C*| x ∈ L ∨ x ∈ M}
(intersection)  L ∩ M ≔ {x ∈ C*| x ∈ L ∧ x ∈ M}
(difference)    L \ M ≔ {x ∈ C*| x ∈ L ∧ x ∉ M}
(symmetric difference)  L ∆ M ≔ (L \ M)∪(M \ L)
And the following monoid operations:
(concatenation)  L ⋅ M ≔ {x⋅y|x ∈ L ∨ y ∈ M}
(exponent)       L^k ≔ {x_1⋅...⋅x_k|x_i ∈ L}
(monoid closure)   L* ≔ ∪_i=0^∞ L^i    — zero or more
concatenations of L
(positive closure) L+ ≔ ∪_i=1^∞ L^i    — one or more
concatenations of L
(S closure)        L^S ≔ ∪_i∈S L^i = {x| x ∈ L^i, for some
i ∈ S}
The monoid closure of L is the submonoid generated by L, and also the
smallest submonoid containing L. The monoid closure is also called
the Kleene closure.
A regular expression language contains symbols representing elements
from C, 𝓟(C), C*, the set and monoid operations above, plus
parenthesizes, typically a minimum: C, ε, ∪, ⋅, *, ( and ), with
the semantics as indicated above, with values the languages in 𝓛
(C), operations on such languages.
----
```

This archive was generated by hypermail 2.1.5 : Mon Oct 22 2007 - 08:10:47 CDT