RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 13:07:35 CDT

Next message: PCL Institute: "Use of interum PUA encodings for 85 letters"

Previous message: Mark Davis: "Re: FYI: Regex paper for UTC"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg [mailto:haberg@math.su.se] wrote:
> On 13 Oct 2007, at 17:38, Philippe Verdy wrote:
>
> >>> Andy and I put together a paper of recommendations for the UTC at
> >>> http://docs.google.com/Doc?id=dfqr8rd5_32kv97tx .
> >>
> >> I suspect you will run into problems when mixing complement with the
> >> Kleene closure (zero or more concatenations) operator (and other
> >> operators generating an infinite number of strings in the associated
> >> language).
> >>
> >> So one safe way is to admit the usual set operations on character
> >> classes only, and letting regular expressions having only the usual
> >> operators acting on character classes.
> >
> > This is only true if you use the assumption that the current state
> > of the
> > generated DFA should be represented using a single integer type.
>
> I am not sure what the DFAs have here to do.

This is just to demonstrate that DFA are not strictly needed to get the
"safe way" to support the "usual set of operations" on something else than
just character classes.

> If the RE r generates the language L(r), and ~r is the complement,
> what is the definition of L(~r)? Specifically:
>
> If L(ε) is the empty string, what is L(~ε)?

It's the set of non-empty texts. So given the alphabet A (which is not
necessarily a single character, but could be any localized set of collation
element) matched by the "." class, it generates L(".+"), in other words:

RE "~ε" = RE ".* -- ε"
= RE ".+"

(note: in this definition, IF "." does not match newlines in your regexp
implementation, then A does not contain any newlines, and the "~" operator
will not make them part of the matched language; you could still match them
separately using some syntax like "\R" but such thing has no complement, and
newlines are not part of the languages recognized by your implementation,
that will use them only as separators between distinct texts, in fact I like
seeing "$" as if it was matching internally the characters that make
newlines while also activating the end-of-text condition: although they will
be matched, they won't be part of the range of characters returned in all
matches, and they belong to a lower-level layer before the input; the regexp
matcher engine will be reset to its initial state after them, as if we were
processing a new text with multiple invocations on texts that are guaranteed
in this case not to contain any newlines).

> What is the definition of L(~(r*))?

For me it's simply:
L(".*") - L("r*") = L(".* -- r*")

where L(".*") is the set of all texts (including the empty text) based on
alphabet A, and where "--" is the difference operator in regexp syntax.

Next message: PCL Institute: "Use of interum PUA encodings for 85 letters"
Previous message: Mark Davis: "Re: FYI: Regex paper for UTC"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 13:09:14 CDT