RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 12:32:55 CST

  • Next message: Rick McGowan: "Unicode server upgrade this week"

    Mike wrote:
    > > [...] as soon as you are introducing collation elements
    > > in regexps, these are sorted by collation, and collations are
    > > locale-sensitive...
    >
    > I don't see why they need to be sorted. All that matters is
    > that you find the longest match. [a-z\q{ch}] will match "ch"
    > in "chinchilla" rather than just "c".

    And what can you do with the negated class? How do you define it
    consistently?

    For what you are defining, you are not creating the necessary support for
    correct support of locales, i.e. you are restricting yourself only to the C
    locale, whose collation is defined strictly in binary order of code points
    and nothing else (you are working in the "C" locale only.

    So in this restricted "C" locale:
    * The classes of collation elements will only contain single code points
    (and effectively, in that locale, there's no possible extension of the set
    of collation elements which is exactly the range \u0000 to \u10FFFF in that
    order, all of them with only primary differences, so they are equal to their
    collation keys)
    * You won't recognize any Unicode canonical equivalence in regexps. (But
    then why are you recognizing them in scanned texts? This is inconsistent.)
    * You won't be able to recognize case-mappings consistently (for case
    insensitive searches), because collation elements will all be distinct with
    only primary differences, and no further levels.
    * Even if you restrict to only the set of primary differences, the only
    case-mappings you will be able to recognize are the simple one-to-one case
    mappings defined in the main UCD file, excluding special mappings (like the
    consistent distictions of dotted and undotted Turkic "I"... or finding
    matches containing "SS" when a German sharp s is specified in the search
    string... or allowing matches for final variants of greek letters)
    * and many other restrictions.

    This may finally be consistant for the "C" locale (with binary order), but
    you have not solved any of the linguistic needs, and even worse, your regexp
    matcher cannot be a Unicode conforming process (because it will return
    distinct sets of matches depending on the encoding or normalization or
    non-normalization of input text and input regexps.)

    What you have done for now is a partial mix, which is intrinsicly
    inconsistent, as soon as you have started converting input texts to NFD
    (i.e. applying a normalization to them without applying the same rule to the
    regexps...)

    I'm not advocating that Unicode regexps should support all locales. It
    should support at least the legacy "C" locale (with binary order), and a
    basic Unicode-based "U" locale (that is *reasonably* neutral to many
    locales) based on the full set of Unicode properties, and the DUCET
    collation elements (you have partly implemented it by recognizing many
    Unicode properties, but not all those needed for consistency).

    Other locales could be defined by tailoring (i.e. allowing the use of
    special case mappings for case-insensitive searches and the use of tailored
    collations): many locales could be supported by a locally implemented
    database, or by external databases specified by the user; some tailoring,
    that would not depend on this preinstalled support of specific locales,
    could be specified directly within regexps working from the predefined "U"
    locale that is implemented with character properties defined up to a given
    Unicode version (further versions would be supported also by user-defined
    tailoring installed on their system, or specific tailoring directly within
    regexps).

    One could avoid the cost of having to handle complete collation (and then
    revert to the binary encoding), without affecting the Unicode conformance
    (canonically equivalent texts and regexps willberecognized), by making the
    support of tailored collations as a flag given to the regexp exgine (in grep
    for example, you would use a dash-option, but in sed or vi you would pass
    the flag after the regexp as a final flag character like C for enabling UCA
    collation and matching by collation elements rather than characters).

    You could also disable finding the canonical equivalences by using another
    flag, but then you must do it consistently, by disabling it BOTH in the
    input texts AND in the input regexp, but NOT only in texts like what you
    have done and you propose.

    You could make the C locale the default if you want (but in the C locale, no
    normalization should be performed on the input text), but there should exist
    a way to specify at least the Unicode neutral locale, where normalization of
    input texts is possible (and whose collation is the DUCET unless it is
    explicitly tailored by the content of the regexp itself).

    However I don't think that normalization of input texts (to NFD in your
    implementation) is the best way to handle the found matches, as
    normalization will not only change the input text before scanning, but it
    will also reorder parts of the input text, because it creates severe
    difficulties for using the discovered matches, for example to apply
    replacements or other Unicode transforms:

    My opinion is that input texts should not be altered, and normalization
    should only be performed on output, and only if it is explicitly part of the
    transforms applied on matches, and even if normalization is not performed on
    the whole output text (if needed a user can perform it separately, or by
    specifying an optional flag that will be off by default).

    It's not up to regexps to make normalizations, but it's up to regexps to be
    able to recognize classes of canonically equivalent texts and find identical
    sets of matches in this case if they want to be Unicode compliant processes.



    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 12:37:31 CST