RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 02:19:46 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    ________________________________________
    De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    part de Mark Davis
    Envoyé : vendredi 21 septembre 2007 18:33
    À : Mike; Andy Heninger
    Cc : unicode@unicode.org; UTC
    Objet : Re: New Public Review Issue: Proposed Update UTS #18

    > allowing multiple values in a property definition such as \p{gc=L|M|N} or
    \p{nv>=10}.

    Allowing multiple values is a nice way to compact the regex. Similarly, in
    my implementation I actually allow a regex within the property value, so for
    example have \p{name=/.*MARK.*/} to pick up all the Unicode characters with
    "MARK" in their name. A bit squirrely, but very handy. We might mention some
    of these techniques as possibilities.

    > As far as your other comments (copied below), the issue is as to what
    [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
    > • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
    "ñ", "ch", "ll", "rr"}.
    > • The set inversion would be the set of all other strings. So that would
    include "0", "A", ... but also "New York", and "onomotopaeic", and so on. An
    infinite set.

    Why do you assume such huge extension of the input universe ?

    The only needed thing is that the inversion set has to be universe minus the
    positive set, and that /./ has to include all possible positive sets, in
    such a way that {/[set]/, /[^set]/} is an exact partition of the universe of
    acceptable input units.

    For you, it should be enough to include in the /./ universe all the UCD
    codepoints that your regexp engine will accept in source texts, converted to
    one of their normalized forms (such as NFC or NFD).

    You are not required to include in /./ all codepoints in the UCS, you may
    restrict /./ to include only assigned and valid characters that you accept
    to reference in valid /simple text/ regexps, and in that case you must also
    accept these in valid /[set]/ regexps and in valid /[^set]/ regexps (the
    syntax used in the regexp formula to reference them does not matter, it may
    require escaping them, but escaping does not change the universe or what
    they represent.

    As a consequence a non-empty file that does not contain any match for [set]
    will not necessarily contain a match for [^set]: this will be the case if
    the file cannot be read as a series of units containing only elements of
    your /./ universe (for example if it contains unassigned characters and your
    /./ universe contains only assigned characters).

    For users, it is first expected that /[set]/ and /[^set]/ form a partition
    of the "universe" { /./ union /\R/ } of input units (the "alphabet" in
    lexers).

    This remains true even if you use the single line mode where line
    terminators are members of /./ including the two-characters sequence
    /\q{\u000D\u000A}/, because /\R/ here is a set such that :
    * in multiline mode, /\R/ contains this sequence and all other
    single-character line terminators in /[\n\v\f\r\p{Zl}\p{Zp}]/ that your
    engine will accept on input files, and has an empty intersection with /./;
    * in single line mode, the /\R/ subset is fully included within /./, so /./
    is the "universe", so that /./ also matches any line terminator;
    * in both cases, the "universe" is { /./ union /\R/ }, and your lexer can be
    built on this finite universe, even if it is built based on bitsets without
    internal representation of negated sets.



    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 02:22:24 CDT