RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 00:50:00 CST

  • Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

    > > So a regexp /ch/ or /\q{ch}/ matches the same
    > > thing, the former being more efficient than the second one because it
    > does
    > > not alter the input universe.
    >
    > The \q{} syntax is only valid (or needed) within square brackets, so
    > /\q{ch}/ is not a valid regular expression. Again, merely using \q
    > should not have side effects.

    If I used // delimiters it's only to show where the regexp parts are within
    the message. But also to mean that it makes no difference if a "collation
    element" (not a "character class" as you call it) if present within []
    brackets or outside of it, provided that it matches as a single item with
    "."

    In that case, you can of course remove \q{} outside of [] because the
    complete regexp: /\q{something}/ will match exactly the same thing as the
    complete regexp: /something/.

    If I used \q{} in a such a way, it was just to show where the delimitation
    of units (that would be matched by a single ".") would occur, without
    introducing a new notation, because this use was consistant with the use of
    the same unit as part of a [set], allowing me to say that [set] becomes
    completely equivalent to (s|e|t) and [a-z\q{ch}] becoming also completely
    equivalent to (a|b|c|...|y|z|\q{ch})
    (note that the "..." ellipsis just above is a local shortcut, not intended
    to be three occurrences of the "." special character)

    But note that with my notation /\q{ch}./ would NOT be equivalent to /ch./
    - the latter regexp will match only 3 characters: /c/ followed by /h/
    followed by what /./ matches by default (i.e. [\u0000-\u10FFF] minus the set
    of line terminators, which depends on the single line or multi-line mode in
    effect, and that I'll note \R).
    - the former regexp extends the input universe (matched by ".") by making it
    [\u0000-\10FFF\q{ch}] (so that it now contains /c/ or /h/ or the sequence
    /ch/).

    With this notation, then /[^\q{ch}]/ makes sense because it is effectively
    matches now any element of the set /./ minus line terminators and the
    sequence /ch/, but /[^\q{ch}]./ will not match the string "ch", even though
    it could be interpreted as the sequence /c/ followed by /h/, where:
    - the first /c/ would match [^\q{ch}] defined as /./ minus /\q{xh}/, i.e.
    now the extended set U = ([\x0-\x10FFF] minus (\R|\q{ch})) where /c/ is
    effectively present
    - the second /h/ would match /./ i.e. the same set U as above

    The reason being that the input universe U has been extended, and the inout
    lexer will return a single unit \q{ch} for every sequence /ch/ occurring in
    the input file but no separated /c/ and /h/ units in that order.

    I spoke about the way to limit the extension of the input universe: I said
    that it would depend on the input locale, that could also be delimited in an
    expression:

    So /./ in the default C locale would be the set [\x0-\x10FFF] of all Unicode
    codepoints minus \R, by default unless it is extended by occurrences of
    \q{strings} within the same locale context. One can revert back to the
    default set by delimiting locales in a way similar to /(?locale=C:.)/

    If you start tour regexp engine matches within a Spanish locale /./ will
    match by default the same thing as /{?locale=es:.)/ but you could still use
    simply the regexp /./ if this locale is simply specified externally or this
    is the user's locale. In this Spanish locale (just a sample), every
    occurrence of /ch/ would be recognized according to the locale definition of
    its alphabet, as a single unbreakable unit, even though it would still
    contain the separate /c/ and /h/ items when they are matchable isolately
    only when they don't occur in sequence in the input file(s) to scan.

    The proposed notation is then not based on individual characters: the /./
    regexp as well as [set] and [^set] will not necessarily match a single
    "character" but a single "collation element" in the current locale context.

    The current locale context is modified everywhere in the regexp by the
    inclusion of new collation elements with \q{}, except in parts of the regexp
    that are explicitly delimited by an explicit locale delimitation using:
     /(?locale=<code>:<rest of regexp>)/

    What this means is that:
    - An English user (as perceived by its user locale defining the default
    context of search) searching for /[a-cd-z]./ will get a match for the string
    "char", because its alphabet makes no difference between [a-z] and [a-cd-z]
    - A Spanish user will not get a match in "char" witch the same regexp
    /[a-cd-z]./, if the Spanish locale is defined with the ordered collation
    elements {[a-z], \q{ch}} or as the ordered collation elements {[a-c], q{ch},
    [d-z]}
    - The English (or Spanish user) that wants explicitly the behaviour of the
    regexp matches found in the Spanish locale, can specify it explicitly by
    using the regexp with the desired locale: /(?locale=es:[a-cd-z].)/ (or by
    specifying the Spanish locale explicitly in an external parameter or flag to
    the regexp engine
    - if these users want to get the default behaviour, they just need the same
    thing by replacing "es" by "C" in the the previous regexp, or in the
    explicit locale parameter when invoking the regexp compiler engine.
    - Mixed locales are possible to limit the extension (by the \q{...}
    notation) of the active ordered set of collation elements. In an expression
    like /(?locale=C:[^\q{ch}])./, the last occurrence of "." will only match
    one collation element of the default user's locale, because the extension
    made by \q{ch} in the part of the regexp explicitly delimted by the locale
    designation does not extend before or after this delimitation everywhere
    else in the regexp: the explicit locale designation defines a scope for the
    modification of the current ordered set of collation elements.

    Now it could be time to define how the use of a collation element \q{ch}
    modifies the current ordered set of collation elements: if \q{ch} is already
    contained in the current set (independently of its order), then the ordered
    set is not altered.

    Otherwise, the ordered set is extended by adding the collation element at
    end of the ordered set, with a primary difference.
    If this position in the ordered set is not suitable, one could specify
    another order by introducing a collation rule in the locale delimitation
    scope, for example (?locale=C;ch<d: ...) which first redefines the current
    set of collation element as the default set used in the C locale (i.e. only
    the binary order of singe-codepoint, then performs a collation tailoring by
    introducing the collation element \q{ch}, a contraction, before \q{d}, a
    collation element with a single codepoint, using primary difference.

    The "C" locale (binary order of single codepoints) would be predefined, as
    well as a "U" locale for the Default Unicode order defined in the DUCET
    (which already contains some special collation elements defined as
    contractions).

    The current set of collation is element is ordered and not a simple
    unordered set, so that ranges used in regexps like /[a-z]/ make sense (this
    "set" notation using a range does not mean just the single characters whose
    code points are between those assigned to "a" and "z" inclusively, but the
    subset of collation elements that are ordered between the collation elements
    "a" and "z".

    More precisely, the current set of collation elements is defined as a
    tailored collation, rather than simply an ordered set of collation elements:
    every local switch to another locale changes (locally within the delimited
    part of the regexp) the current tailored collation to the collation
    associated to that locale.

    This refined definition is normally not needed for simple regexps that try
    to match /[set]/ or /[^set]/, including when the sets contain ranges like in
    /[a-z]/ or /[^a-z]/. This will be needed when trying to match more complex
    ranges based on multiple collation elements.

    One could want finding matches for strings that collate between two strings
    in a given locale. This would require another notation than ranges in []
    sets. For example one could use (?range:string1:string2:regexp3) which does
    not modify the current ordered set of collation elements, but where the
    three parameters "string1" and "string2" and "regexp3" are parsed according
    to the current tailored collation.

    For example to match all 3 letters words in Spanish between c and d
    (inclusive, but "c" and "d" won't match because they are not 3 letters) one
    would use /(?locale=es:(?range:c:d:...))/

    A Spanish user would just enter the simpler regexp /(?range:c:d:...)/
    without needing to qualify its search with a delimited locale scope.

    Note here that there are three dots, each one matches a single Spanish
    collation element, according to the Spanish collation which is in scope, but
    the whole expression will match only those strings that collate between "c"
    and "d" inclusively (the boundaries of a string range do not have to match
    the third regular expression, and if they don't they won't be possible
    matches)

    A notation similar to the (?range:) notation could be done for specifying
    unbounded minimum or maximum boundaries.
    - given that the smallest minimum boundary is the empty string:
    (?range::d:.*) will match every string lower than or equal to the upper
    bound "d", in the current tailored collation.
    - given that the largest maximum boundary is an infinite string there would
    be no way to specify it, but one could simply remove one parameter to the
    range, so that (?range:c:.*) will just specify the lower bound "c" and the
    last parameter ".*" will be the regexp defining candidate matches to check
    with the range condition, and will match any string higher than or equal to
    "c" in the current tailored collation.

    To exclude the boundaries from a range, one could use the "^" special
    character for this purpose (which could be quoted if needed for interpreting
    it literally). It could be used for both the upper or lower bound, including
    the empty string in the lower bound; beside this special meaning of "^" and
    the interpretation of escapes, the upper and lower bounds in string range
    conditions are not regexps but string literals.

    With such system, we have now defined regexps that make sense linguistically
    for every locale, and a system that allow regexps to be also interpreted
    without depending on the user locale,when this is needed, and a system that
    allows matching additional collation elements when needed.

    The complexity here is not in the implementation of the classical finite
    state automata used in regexp compilers and matchers: it is within the
    dynamic construction and use of multiple tailored collations (but only one
    collation is used in each state of the finite state automata), and in its
    input lexer, because the same input file may need to be "lexed" into
    different collation elements, in parallel, according to the current states
    of the finite state automata, each state of the automata referencing its own
    input lexer depending on the locale in scope.

    ----
    Note that in regexp matchers, the finite state automata built from the
    regexp has multiple active states in their state transition graph, and the
    number of active states varies after each input because every transition in
    the graph is possible and must be tested in simultaneously, due to
    aggregates like "+", "*", "?" or "{m,n}" counters, but also because of
    alternatives with "|" and "[set]" which may not be exclusive; on the
    opposite, the construction of the graph of transitions and their condition,
    is trivial from the regexp itself.
    In the transition graph, there's only one node representing the initial
    state and only one node representing a final matching state (however when a
    match is found, this may not be the only active node, and if multiple
    matches must be searched one must add a null transition from the final node
    to the initial node, but keep the existing list of active nodes when
    continuing the search). There is also often an additinial final state
    meaning the absence of match in the current state (this node has no
    successor, the only transition from this terminal node requires an
    unconditional read of one default collation element from the input before
    looping unconditionally back to the initial node).
    There are algorithmic methods to optimize the traversal of the transition
    graph (by the match method fed with the text input):
    * they try reducing the number of active nodes and the number of transitions
    (between two connected nodes) to test at each step, but many of these local
    optimization will explode the number of nodes needed in the transition
    graph;
    * so many implementations of regexp matchers don't perform such local
    optimizations (which may require too much memory and computing time for
    returning the optimum but very large graph):
    * such optimization of the transition graph is normally performed in the
    regexp compilation method, that takes a regexp (plus optional global flags
    for its specifying its semantic and additional restrictions or
    case-insensitive extensions) and returns the compiled transition graph;
    * for details, look into the wellknown Aho-Seti-Ullman book or other good
    reference book on the implementation of compilers, finite-state automatas
    and regexps...
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 00:53:30 CST