**From:** Philippe Verdy (*verdy_p@wanadoo.fr*)

**Date:** Sat Oct 06 2007 - 22:36:49 CDT

**Previous message:**N. Ganesan: "Blackberry"**In reply to:**Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Next in thread:**Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Reply:**Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

Mike wrote:

*> Mark Davis wrote:
*

*> > If sets don't have an implied ordering, do we need to require a POSIX
*

*> > style longest match, which could be slow?
*

*>
*

*> In a set, I keep track of the strings by their length, so the longest
*

*> match is always found. I don't think you want to backtrack and try a
*

*> shorter string since the longer match is supposed to be treated as a
*

*> unit....
*

Mark, the question is not "if sets don't have an implied ordering". By

definition a set is a unordered thing. There does not exist any ordered set.

Two good questions to ask then are:

(1) Ordering does not affect sets, but the way they are *constructed* using

set constructors like ranges. The ranges that you can insert within a set

constructor are ordered.

So please, don't mix the regexp expressed syntaxically as "[abc]", which is

a set constructor (more exactly a compound constructor), with the set it

builds. In other words "[abc]" is NOT a set but an operation that builds a

set from several constants ('a', 'b', 'c') and operators ('[', ']' and the

side-by-side concatenation, which is here idempotent to a function

application for the right side, and the composition of functions on its left

side).

(1) What are the ur-elements in this set?

(for the definition of what is an ur-element, and how it differs from an

element, look for example in Wikipedia: an element can contain things, an

ur-lement can only be part of a set. A synonym of "ur-element" is "atom"

within some mathematical branches that don't make distinctions between

"complete" and "incomplete" domains, but all incomplete domains, where atoms

and ur-elements are considered distinct, can be extended as being part of a

complete domain, where they don't differ.)

To determine which are the ur-element, the first thing you must determine is

the domain, and you must also determine if the domain is complete (in that

case you must admit the existence of the undetermined element, generally

named "unknown" or "undefined" or "bottom", as a member of ALL sets in this

domain) or not (in that case you need to discriminate between strict and non

strict pattern matching rules).

This is justified mathematically (theorem of Gödel) according to any

self-consistent and complete logic system: to make the system complete, the

existence of the undetermined element within the domain is absolutely

necessary. Its main property is that it is undecidable (either true or

false), infinite (if you use finite inference rules), omnipotent (present in

every set, including all meta-sets or the set of functions acting on

elements of the domain), matchable (using pattern-matching rules),

productive (it transforms a unproductive system that never ends within an

incomplete domain, into a productive system that is guaranteed to terminate

in a finite number of steps when working in the complete domain).

One of the consequence of the Theorem Gödel is that EVERY incomplete system

can be viewed as a consistent part of a complete system. You just need to

add the "bottom" element as a finite and well-behaved ur-element of your

domain).

Another consequence is that a complete domain is necessarily structured (and

not flat) so that the domain can be fully ordered, and that complete domains

are necessarily fully ordered (i.e. in a complete domain, there exists a

binary relation R between any two elements a and b of the domain so that the

binary relation (a R b) always takes a unique value within the domain; if

the domain is complete, it contains the Boolean ur-elements FALSE and TRUE,

and the set of Booleans is {FALSE,TRUE} which also includes implicitly the

BOTTOM value which is also a valid finite member of the empty set ! (Forget

the initial assumption that an empty set cannot contain any finite

ur-element; this does not contradict the concept of set cardinality which is

just a function from any set to the set of integers, for which the empty set

is mapped to zero : the BOTTOM value which is a member of the empty set is

just not counted, something that is possible because the complete domain

allows pattern matching on the BOTTOM value).

Let's return to our discussion: we need a "complete" system that describes

texts, strings, characters, and operations like concatenation, and an

operation between two strings (a regexp and any other text) that maps them

to a boolean. In other words, we want to define predicates in a way that

will be both self-consistent, will not induce contradictions (where the

predicate will not take both the value FALSE and TRUE), and will reply in a

finite time. For this to work as intended you need a BOTTOM value. In

regexps, the BOTTOM value is NOT "." (this matches all finite ur-elements of

the domain, but NOT the BOTTOM value itself).

And we need a clear definition of ur-elements in the domain: to be

consistent, this system must be built so that all strings, all characters,

all operations on them will be definable in terms of ur-elements:

(a) either you decide that characters are all atomic (i.e. ur-elements), and

then strings are simple ordered vectors of atomic elements. So the "."

regexp is will match a single character and nothing else, but every possible

character except BOTTOM. This is the interpretation of strings in classic C

and POSIX locales.

(b) either you decide that characters are NOT atomic, but are in fact

operators working in a larger domain (for example the domain of collation

elements). In that case strings are viewed as vectors of collation elements,

and the concatenation operation used to build texts is inconsistent, and

should not be part of the domain (and this effectively happens in the real

world). To solve the inconsistency, you need to redefine the operation of

concatenation in other terms, and you also need to map the "string" datatype

into something more powerful than a simple element: any string is then not

simple a vector, but a function of your domain, and the operation of

concatenating two strings can be mapped as the operation of composing two

functions of that domain... In that case, the "string" datatype is just a

restriction of the more general "text" datatype to the subset of ur-elements

where there's no interaction between the collation elements that you have

kept. This is the interpretation of texts within humane languages.

This makes a serious difference, but one way to think about the various

complete systems you can build with the latter definition of "string" is

that they are all particular instances of the "text" instance within a

locale context that defines all the possible interactions between characters

and strings in such basic operation like concatenation.

My personal view about regexps needed for standardization is that it is

necessary to define it in a way that is self-consistent, and complete enough

to be as powerful as a Turing system.

This is possible, but this requires a very strict mathematical definition,

using mathematical language theories, and other important results like the

Gödel theorem for completeness criteria, which is also extremely important

for being able to decide about computationability in finite time (or

resources).

I have built my regexp matcher algorithm using such formalism, and one of

the marvellous effects of it, is that the addition of BOTTOM within the

rules of my regexp matcher (so that it becomes matchable) has considerably

simplified the implementation. And exactly the same algorithm implementation

can work on every model for texts, i.e. (a) or (b) above, by unifying the

two and saying that ur-elements are defined according to a locale-context.

Another benefit, is that I no longer need and backtracking, and the system

is perfectly decidable in a finite number of steps (this is guaranteed

because I have built my system to be complete, simply by adding the

possibility of matching BOTTOM in a finite single step of decision).

Suppose that your initial domain is containing the set of characters [ab],

and any character from this set, and any string resulting from the

concatenation of these characters (the operation of concatenating strings is

also part of the domain). This domain is not decidable without adding to it

the two Booleans FALSE and TRUE (that allows writing predicates like

membership), and the basic operation on sets (union, intersection and

negation). The system is also not complete without adding another element to

it, that you can note '?'

Then the basic pattern matching rule for characters need to include:

* BOTTOM `matches` BOTTOM = TRUE

* BOTTOM `matches` . = BOTTOM

* . `matches` BOTTOM = BOTTOM

* 'a' `match` 'a' = TRUE

* 'b' `match` 'b' = TRUE

* _ `match` _ = FALSE

(note that what is noted by _ matches every ur-element except BOTTOM, this

is consistant with the role of "." in regexps, but this is another question,

because this depends on the semantic you assign to "." in regexp, where here

we are creating rules operating on every pair of members of the complete

system, including texts, regexps, and operations on regexps!)

Note that in this formal domain, the top 3 rules are needed because a

self-consistent and complete formal system must be able to describe itself

using its own language, where every function of two elements is described as

a ordered suites of pattern matching clause, with an implicit final clause

that you don't need to specify and that matches all the rest and assigns it

the BOTTOM value.

There exists some computer languages that allow defining such complete

system: Haskell is one of these, however it does not enforce or require the

completeness of functions, so you have to add an explicit pattern matching

clause to your functions definition, or choose between two alternate

syntaxes that make the function complete (non strict) or incomplete

(strict).

The remarkable thing of a complete language (like Haskell) is that it can be

fully described in its own language (but then it does not say how to compile

it in a finite time into a finite object, but Haskell does not need to

compile a program completely so that it becomes productive, because it uses

something else, notably "lazy evaluation" without which it would not work;

lazy evaluation is guided by a limited set of strict pattern matching rules,

allowing the language to work as intended).

(Note that Haskell does not define any syntax for using BOTTOM explicitly

and directly, its "_" notation is not BOTTOM, its implicit "forall

quantifiers" are not BOTTOM, but synonyms for "_", except that they are

given specific named for the purpose of matching different instances of "_"

; Haskell just uses BOTTOM implicitly within the way its pattern matching

system works for type inference, something that is enough for allowing

programs written in Haskell to build an expression that is idempotent to

BOTTOM, so that the absence of enforcement of strictness in Haskell allows

the language to be complete according to Gödel's theorem).

I prefer the "complete" approach when building a regexp matcher: it is much

simpler and allows building a language that is as powerful as the Turing

machine; any incomplete approach would not solve all problems, and could

require algorithms that are undecidable within a finite time because they

lack some possibility, such as not permitting matches on BOTTOM.

Within my implementation, BOTTOM is just implicitly a member of all

"character" classes like "[abc]" (including negated classes like "[^abc]",

or the "any character" class like ".") I don't even need a syntax within

Regexps to match it explicitly (although I could do it using some \notation,

however it is not needed for matching any regexp with texts that don't

contain it; but it could be used to match a regexp with another regexp, and

build a finite algorithm that will decide if two languages are semantically

equivalent).

But BOTTOM appears as being necessary for the implementation (and yes, this

really simplifies things a lot, because now I know some theorems about the

completeness of the language, which is now definable in a finite way, with a

finite number of inference rules and a finite number of ur-elements!)

**Next message:**Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Previous message:**N. Ganesan: "Blackberry"**In reply to:**Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Next in thread:**Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Reply:**Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Sat Oct 06 2007 - 22:40:27 CDT
*