RE: FYI: Regex paper for UTC

From: Philippe Verdy (
Date: Sat Oct 13 2007 - 13:07:35 CDT

  • Next message: PCL Institute: "Use of interum PUA encodings for 85 letters"

    Hans Aberg [] wrote:
    > On 13 Oct 2007, at 17:38, Philippe Verdy wrote:
    > >>> Andy and I put together a paper of recommendations for the UTC at
    > >>> .
    > >>
    > >> I suspect you will run into problems when mixing complement with the
    > >> Kleene closure (zero or more concatenations) operator (and other
    > >> operators generating an infinite number of strings in the associated
    > >> language).
    > >>
    > >> So one safe way is to admit the usual set operations on character
    > >> classes only, and letting regular expressions having only the usual
    > >> operators acting on character classes.
    > >
    > > This is only true if you use the assumption that the current state
    > > of the
    > > generated DFA should be represented using a single integer type.
    > I am not sure what the DFAs have here to do.

    This is just to demonstrate that DFA are not strictly needed to get the
    "safe way" to support the "usual set of operations" on something else than
    just character classes.

    > If the RE r generates the language L(r), and ~r is the complement,
    > what is the definition of L(~r)? Specifically:
    > If L() is the empty string, what is L(~)?

    It's the set of non-empty texts. So given the alphabet A (which is not
    necessarily a single character, but could be any localized set of collation
    element) matched by the "." class, it generates L(".+"), in other words:

      RE "~" = RE ".* -- "
              = RE ".+"

    (note: in this definition, IF "." does not match newlines in your regexp
    implementation, then A does not contain any newlines, and the "~" operator
    will not make them part of the matched language; you could still match them
    separately using some syntax like "\R" but such thing has no complement, and
    newlines are not part of the languages recognized by your implementation,
    that will use them only as separators between distinct texts, in fact I like
    seeing "$" as if it was matching internally the characters that make
    newlines while also activating the end-of-text condition: although they will
    be matched, they won't be part of the range of characters returned in all
    matches, and they belong to a lower-level layer before the input; the regexp
    matcher engine will be reset to its initial state after them, as if we were
    processing a new text with multiple invocations on texts that are guaranteed
    in this case not to contain any newlines).

    > What is the definition of L(~(r*))?

    For me it's simply:
    L(".*") - L("r*") = L(".* -- r*")

    where L(".*") is the set of all texts (including the empty text) based on
    alphabet A, and where "--" is the difference operator in regexp syntax.

    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 13:09:14 CDT