RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 06 2007 - 22:36:49 CDT

  • Next message: Dominikus Scherkl: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

    Mike wrote:
    > Mark Davis wrote:
    > > If sets don't have an implied ordering, do we need to require a POSIX
    > > style longest match, which could be slow?
    >
    > In a set, I keep track of the strings by their length, so the longest
    > match is always found. I don't think you want to backtrack and try a
    > shorter string since the longer match is supposed to be treated as a
    > unit....

    Mark, the question is not "if sets don't have an implied ordering". By
    definition a set is a unordered thing. There does not exist any ordered set.

    Two good questions to ask then are:

    (1) Ordering does not affect sets, but the way they are *constructed* using
    set constructors like ranges. The ranges that you can insert within a set
    constructor are ordered.

    So please, don't mix the regexp expressed syntaxically as "[abc]", which is
    a set constructor (more exactly a compound constructor), with the set it
    builds. In other words "[abc]" is NOT a set but an operation that builds a
    set from several constants ('a', 'b', 'c') and operators ('[', ']' and the
    side-by-side concatenation, which is here idempotent to a function
    application for the right side, and the composition of functions on its left
    side).
     
    (1) What are the ur-elements in this set?

    (for the definition of what is an ur-element, and how it differs from an
    element, look for example in Wikipedia: an element can contain things, an
    ur-lement can only be part of a set. A synonym of "ur-element" is "atom"
    within some mathematical branches that don't make distinctions between
    "complete" and "incomplete" domains, but all incomplete domains, where atoms
    and ur-elements are considered distinct, can be extended as being part of a
    complete domain, where they don't differ.)

    To determine which are the ur-element, the first thing you must determine is
    the domain, and you must also determine if the domain is complete (in that
    case you must admit the existence of the undetermined element, generally
    named "unknown" or "undefined" or "bottom", as a member of ALL sets in this
    domain) or not (in that case you need to discriminate between strict and non
    strict pattern matching rules).

    This is justified mathematically (theorem of Gödel) according to any
    self-consistent and complete logic system: to make the system complete, the
    existence of the undetermined element within the domain is absolutely
    necessary. Its main property is that it is undecidable (either true or
    false), infinite (if you use finite inference rules), omnipotent (present in
    every set, including all meta-sets or the set of functions acting on
    elements of the domain), matchable (using pattern-matching rules),
    productive (it transforms a unproductive system that never ends within an
    incomplete domain, into a productive system that is guaranteed to terminate
    in a finite number of steps when working in the complete domain).

    One of the consequence of the Theorem Gödel is that EVERY incomplete system
    can be viewed as a consistent part of a complete system. You just need to
    add the "bottom" element as a finite and well-behaved ur-element of your
    domain).

    Another consequence is that a complete domain is necessarily structured (and
    not flat) so that the domain can be fully ordered, and that complete domains
    are necessarily fully ordered (i.e. in a complete domain, there exists a
    binary relation R between any two elements a and b of the domain so that the
    binary relation (a R b) always takes a unique value within the domain; if
    the domain is complete, it contains the Boolean ur-elements FALSE and TRUE,
    and the set of Booleans is {FALSE,TRUE} which also includes implicitly the
    BOTTOM value which is also a valid finite member of the empty set ! (Forget
    the initial assumption that an empty set cannot contain any finite
    ur-element; this does not contradict the concept of set cardinality which is
    just a function from any set to the set of integers, for which the empty set
    is mapped to zero : the BOTTOM value which is a member of the empty set is
    just not counted, something that is possible because the complete domain
    allows pattern matching on the BOTTOM value).

    Let's return to our discussion: we need a "complete" system that describes
    texts, strings, characters, and operations like concatenation, and an
    operation between two strings (a regexp and any other text) that maps them
    to a boolean. In other words, we want to define predicates in a way that
    will be both self-consistent, will not induce contradictions (where the
    predicate will not take both the value FALSE and TRUE), and will reply in a
    finite time. For this to work as intended you need a BOTTOM value. In
    regexps, the BOTTOM value is NOT "." (this matches all finite ur-elements of
    the domain, but NOT the BOTTOM value itself).

    And we need a clear definition of ur-elements in the domain: to be
    consistent, this system must be built so that all strings, all characters,
    all operations on them will be definable in terms of ur-elements:

    (a) either you decide that characters are all atomic (i.e. ur-elements), and
    then strings are simple ordered vectors of atomic elements. So the "."
    regexp is will match a single character and nothing else, but every possible
    character except BOTTOM. This is the interpretation of strings in classic C
    and POSIX locales.

    (b) either you decide that characters are NOT atomic, but are in fact
    operators working in a larger domain (for example the domain of collation
    elements). In that case strings are viewed as vectors of collation elements,
    and the concatenation operation used to build texts is inconsistent, and
    should not be part of the domain (and this effectively happens in the real
    world). To solve the inconsistency, you need to redefine the operation of
    concatenation in other terms, and you also need to map the "string" datatype
    into something more powerful than a simple element: any string is then not
    simple a vector, but a function of your domain, and the operation of
    concatenating two strings can be mapped as the operation of composing two
    functions of that domain... In that case, the "string" datatype is just a
    restriction of the more general "text" datatype to the subset of ur-elements
    where there's no interaction between the collation elements that you have
    kept. This is the interpretation of texts within humane languages.

    This makes a serious difference, but one way to think about the various
    complete systems you can build with the latter definition of "string" is
    that they are all particular instances of the "text" instance within a
    locale context that defines all the possible interactions between characters
    and strings in such basic operation like concatenation.

    My personal view about regexps needed for standardization is that it is
    necessary to define it in a way that is self-consistent, and complete enough
    to be as powerful as a Turing system.

    This is possible, but this requires a very strict mathematical definition,
    using mathematical language theories, and other important results like the
    Gödel theorem for completeness criteria, which is also extremely important
    for being able to decide about computationability in finite time (or
    resources).

    I have built my regexp matcher algorithm using such formalism, and one of
    the marvellous effects of it, is that the addition of BOTTOM within the
    rules of my regexp matcher (so that it becomes matchable) has considerably
    simplified the implementation. And exactly the same algorithm implementation
    can work on every model for texts, i.e. (a) or (b) above, by unifying the
    two and saying that ur-elements are defined according to a locale-context.

    Another benefit, is that I no longer need and backtracking, and the system
    is perfectly decidable in a finite number of steps (this is guaranteed
    because I have built my system to be complete, simply by adding the
    possibility of matching BOTTOM in a finite single step of decision).

    Suppose that your initial domain is containing the set of characters [ab],
    and any character from this set, and any string resulting from the
    concatenation of these characters (the operation of concatenating strings is
    also part of the domain). This domain is not decidable without adding to it
    the two Booleans FALSE and TRUE (that allows writing predicates like
    membership), and the basic operation on sets (union, intersection and
    negation). The system is also not complete without adding another element to
    it, that you can note '?'

    Then the basic pattern matching rule for characters need to include:
    * BOTTOM `matches` BOTTOM = TRUE
    * BOTTOM `matches` . = BOTTOM
    * . `matches` BOTTOM = BOTTOM
    * 'a' `match` 'a' = TRUE
    * 'b' `match` 'b' = TRUE
    * _ `match` _ = FALSE

    (note that what is noted by _ matches every ur-element except BOTTOM, this
    is consistant with the role of "." in regexps, but this is another question,
    because this depends on the semantic you assign to "." in regexp, where here
    we are creating rules operating on every pair of members of the complete
    system, including texts, regexps, and operations on regexps!)

    Note that in this formal domain, the top 3 rules are needed because a
    self-consistent and complete formal system must be able to describe itself
    using its own language, where every function of two elements is described as
    a ordered suites of pattern matching clause, with an implicit final clause
    that you don't need to specify and that matches all the rest and assigns it
    the BOTTOM value.

    There exists some computer languages that allow defining such complete
    system: Haskell is one of these, however it does not enforce or require the
    completeness of functions, so you have to add an explicit pattern matching
    clause to your functions definition, or choose between two alternate
    syntaxes that make the function complete (non strict) or incomplete
    (strict).

    The remarkable thing of a complete language (like Haskell) is that it can be
    fully described in its own language (but then it does not say how to compile
    it in a finite time into a finite object, but Haskell does not need to
    compile a program completely so that it becomes productive, because it uses
    something else, notably "lazy evaluation" without which it would not work;
    lazy evaluation is guided by a limited set of strict pattern matching rules,
    allowing the language to work as intended).

     (Note that Haskell does not define any syntax for using BOTTOM explicitly
    and directly, its "_" notation is not BOTTOM, its implicit "forall
    quantifiers" are not BOTTOM, but synonyms for "_", except that they are
    given specific named for the purpose of matching different instances of "_"
    ; Haskell just uses BOTTOM implicitly within the way its pattern matching
    system works for type inference, something that is enough for allowing
    programs written in Haskell to build an expression that is idempotent to
    BOTTOM, so that the absence of enforcement of strictness in Haskell allows
    the language to be complete according to Gödel's theorem).

    I prefer the "complete" approach when building a regexp matcher: it is much
    simpler and allows building a language that is as powerful as the Turing
    machine; any incomplete approach would not solve all problems, and could
    require algorithms that are undecidable within a finite time because they
    lack some possibility, such as not permitting matches on BOTTOM.

    Within my implementation, BOTTOM is just implicitly a member of all
    "character" classes like "[abc]" (including negated classes like "[^abc]",
    or the "any character" class like ".") I don't even need a syntax within
    Regexps to match it explicitly (although I could do it using some \notation,
    however it is not needed for matching any regexp with texts that don't
    contain it; but it could be used to match a regexp with another regexp, and
    build a finite algorithm that will decide if two languages are semantically
    equivalent).

    But BOTTOM appears as being necessary for the implementation (and yes, this
    really simplifies things a lot, because now I know some theorems about the
    completeness of the language, which is now definable in a finite way, with a
    finite number of inference rules and a finite number of ur-elements!)



    This archive was generated by hypermail 2.1.5 : Sat Oct 06 2007 - 22:40:27 CDT