RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 01:31:33 CDT

  • Next message: Philippe Verdy: "RE: Normalization in panlingual application"

    Andy Heninger wrote:
    > Regarding the question of how to complement a [^set] that
    > contains strings, or grapheme clusters, or collation elements,
    > or whatever we want to call them, I am still struggling with
    > what it means, and what makes sense.

    The first intutitive approach to what [^set] means is that it should match
    everywhere [set] does not match, and [set] should match everywhere [^set]
    doest not match, i.e. they should be perfect complementary of each other.

    But already, they are aren't perfect complements because both will exclude
    line terminators in multiline mode. This is solved by saying that, in
    multiline mode, there's no line terminator in any content that is accessible
    to searches (each line is treated as a separate text, whose line terminators
    have been dropped), so that "." (the universe for matching classes) is the
    union of [set] and [^set] (these two subsets create a partition of the "."
    set).

    Now if you accept digraphs or grapheme clusters in [set], you should accept
    them also in [^set] and "." should also include all digraphs and grapheme
    clusters, but this means that "." will need to include all possible texts,
    because digraphs are not limited in size. As this seems unreasonable
    (because it would make counting the number of matches with "." impossible to
    perform), it seems reasonable to exclude the possibility of using digraphs
    in [set].

    There remains the possibility of including (default) grapheme clusters in
    the "." universe (and as a consequence in [set] and [^set]) but the exact
    definition of grapheme clusters that are counted as 1 unit or "." remains a
    problem to specify; this is something that will depend on the implementation
    of regexps (some implementations will allow you to specify which strings the
    "." universe includes), but at least for Unicode, the minimum universe
    should be the UCS, limited to single code points.

    But this means a regexp engine using this minimum set will be able to find E
    WITH ACUTE only if it is encoded as a single code point, but not when it is
    encoded in NFD form despite it is canonically equivalent; such regexp engine
    will then not be a conforming Unicode process because it will generate
    different output (distinct sets of matches) from canonically equivalent
    inputs.

    So the idea of implanting regexps by making them find matchs in the NFD
    transformation of the input text is good as it creates a conforming process.
    The bad thing is that E WITH ACUTE is no more a single character and is then
    absent from the "." universe and can't be part of [set] and [^set].

    Another possibility is to include in the "." universe the NFD transformation
    of every code point of the UCS, in such a way that the sequence <C,
    COMBINING ACUTE ACCENT> is still counted as 1 unit, but <C, COMBINING ACUTE
    ACCENT, COMBINING CEDILLA > will be counted as 2 "." units (but then
    remember that "." is sensitive to Unicode versions).

    But then, should a search for <C WITH COMBINING ACUTE ACCENT>, equivalent to
    a search for <C, COMBINING ACUTE ACCENT> in NFD form, will easily match the
    text <C WITH COMBINING ACUTE ACCENT, COMBINING CEDILLA>, but should it match
    the text encoded as <C WITH CEDILLA, COMBINING ACUTE ACCENT>, which is
    canonically equivalent? If the intent is to produce a Unicode conforming
    process, it should be yes. So matches will be for two non-contiguous
    subtrings in the scanned text, excluding the CEDILLA part !

    * (1) If it includes C WITH CEDILLA completely, then the matched substring
    contains more than just C WITH COMBINING ACCENT (and this will be a problem
    is the matched string is used as a candidate for replacement by another
    text, because the replacement will drop the CEDILLA despite it was not
    looked for in the regexp).
    * (2) If it does not include the CEDILLA part, how to perform the
    replacement the two substrings in the text that are matching the regular
    expression? The way to do it would be to reorder the second substring just
    after the first one (allowed because the match was for any text canonically
    equivalent to the searched regexp), so that the CEDILLA will be left just
    after the replacement string. After the replacement has been made, the text
    may eventually be reordered in NFD or NFC form (but I think it should not be
    performed by default, as the user may may to delete manually this CEDILLA
    when it does not make sense to keep it after the replacement made in a text
    editor).

    With this profile (2), the universe "." needs not include all possible
    grapheme clusters (because there are countless). The "." universe contains:
    * all Unicode code points excepting those that have a canonical
    decomposition mapping in the UCD,
    * plus the strings in NFD form that result from the NFD form of UCD
    charactes that have a canonical decomposition mapping in the UCD.
    When looking for matches in the text, the regexp engine will scan the text
    by converting it internally into NFD form before looking up for elements in
    the "." universe above. Matches found will possibly include a discontinuous
    trailer (with combining characters) in the original text (not in NFD form),
    but they will be continuous in the internal NFD representation of the text.
    The trailer will necessarily be in the last characters of the corresponding
    match in the original text (it may be reduced to a NFD part of the last
    character of this source text).

    To perform replacements, one has to look at the last character in the
    matched source text, decompose it to NFD in an internal store, then going
    back to the beginning, comparing it to the last matched cluster in the
    regexp. The internal trailer, not matched, will not bereplaced, but will be
    reinserted in the text after the replacement string.

    But it is still a conforming process that produces equal sets of matches
    from canonically equivalent inputs.



    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 01:35:24 CDT