RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 22 2007 - 16:13:19 CDT

  • Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

    Hans Aberg [mailto:haberg@math.su.se]
    > On 22 Oct 2007, at 22:16, Philippe Verdy wrote:
    >
    > > Note that L may contain strings containing strings like a base
    > > letter followed by a diacritic, which is canonically equivalent to
    > > its precomposed form. Would only the precomposed form would be
    > > allowed in [L] ? The definition of "length" is not precise enough.
    > > Forme the composed nas precomposed letters should behave
    > > identically, ans so their "length" should be 1 in both case. If so,
    > > then [L] will contain BOTH the precomposed letter and the sequence
    > > of a letter and a diacritic.
    >
    > Read all the stuff. There are different constructions.

    Try to reformulate your stuff by avoiding the confusion between the regexps
    and the strings it matches.

    For me, a regexp is not a string, but a function mapping any text to a set
    of matches. But for simplicity we need another object, i.e. a function that
    returns true only if there's a full match and just returns true or false
    instead of a set of matches.

    Let's define it so that:
    Match_r : String -> Boolean, where r is a character (used as a regexp but
    wherer has no special meaning)
    Match["a"] (x) = true, if x="a"
    Match["a"] (x) = false, otherwise
    (Here it is just a function that compares canonically equivalence of
    characters to the character "r")

    This definition is consistent with the Unicode process conformance rule for
    its argument. But it does not indicate anything about the syntax used
    effectively in the regexp meant between the brackets

    Match["a\u0301"] ("a") = false
    Match["a\u0301"] ("\u00E1") = true
    Match["\u00E1"] ("a\u0301") = true
    (note above: there's no "\u" notation in the source, this is a way to refer
    to the actual character only for the definition)

    The definition of a Regexp RE is that it will return a set of matches from
    its input text T argument, each returned matched being defined by the
    association of the source text T and a interval of positions (a_i,b_i)
    within that text, so that the substring extracted from the source text with
    this interval will satisfy:

    Match[RE]( T.substring(a_i,b_i) ) = true.

    Then retry formulating your langage. There's a clear separation between the
    language of regexps and the language of strings that it matches, because
    they don't use the same symbols:
    - the language of strings is U* where U is the Unicode character set, which
    defines two equivalence relations: = (strict equality) and ~ (canonical
    equivalence).
    - the language of regexps is (U union R)* where R is the set of regexp
    operators, and U designates literals). This language has NO canonical
    equivalence, except when they are explicitly defined by an operator of R;
    - there's a third language, which results from a surjection of the previous
    language into U*, and this function is the syntax of regexps; and this is
    the language that we use to specify regexps like "x.*" (where "." and "*"
    are not interpreted as literal characters, but as operators defining
    classes); there are tons of such languages, but here we don't matter match
    about the syntax, we just choose one conventionally but any other language
    would do the same (the difference is just on the syntax, not its
    interpretation).



    This archive was generated by hypermail 2.1.5 : Mon Oct 22 2007 - 16:16:05 CDT