Re: Computing default UCA collation tables

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 20 2003 - 12:26:59 EDT

  • Next message: Mark Davis: "Proposed Update: UTS #6: A Standard Compression Scheme for Unicode"

    There are a number of errors in this account.

    > As far as I know, ICU does not preprocess the "allkeys.txt", but
    uses its own "FractionalUCA.txt" file, probably manually edited after
    some intermediate parsing

    I am pretty darn'd confident that ICU does preprocess the allkeys.txt
    file, since I wrote the program myself. And there is no manual
    editing.

    >also it is currently derived from Unicode 3.2
    It is generated from the allkeys.txt for 3.1, because there is no
    allkeys.txt for 3.2; the next one will be for 4.0. It does use the
    canonical decompositions for the latest version of Unicode; that is
    described in C4 of
    http://www.unicode.org/reports/tr10/tr10-9.html#Conformance

    > because it uses some weights ordering that is not documented in the
    UCA collation rules specification

    While the UCA specification is not required to document each weight,
    if there are particular instances that would be useful to document, we
    can look at those. Please make a submission to that effect. And note
    that the proposed update will be discussed at the upcoming UTC meeting
    in June, so please make it well before then:

    http://www.unicode.org/reports/tr10/tr10-10.html

    >proving the fact that this file was edited manually, and may contain
    errors or other omissions.

    The file allkeys.txt *is* generated by a program that Ken Whistler
    developed, call 'sifter'. It takes as input information about the
    relative ordering of certain characters, plus special data for
    characters that collate as if they were decomposed. (I have an
    independent program that verifies that the output of the sifter meets
    various consistency requirements, such as canonical equivalence,
    transitivity, non-overlap.) The actual ordering that Ken's program
    uses is on the basis of decisions made over time the UTC and WG20. The
    file is not "edited manually"; it is generated from a much smaller set
    of data.

    >also incomplete face to Unicode 4
    It is clear in the proposed update that an update of the data file for
    4.0 is being prepared, but is not yet available.

    > (the only thing that is normative, the "allkeys.txt" being just
    informative and a correct implementation of the specified rules).

    The 'allkeys.txt' data is *not* normative in the sense that it is not
    required for any given language (it is expected that it will be
    tailored for most if not all languages). However, it *is* normative in
    the sense that if you claim to support the Default Unicode Collation
    Element Table (in allkeys.txt), and yet do not produce the same
    ordering as the specification would produce, you are violating C1. If
    that is not clear from the text, we should make it so in the proposed
    update.

    Mark Davis
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: <unicode@unicode.org>
    Sent: Tuesday, May 20, 2003 07:55
    Subject: Re: Computing default UCA collation tables

    > From: "Mark Davis" <mark.davis@jtcsv.com>
    > To: "Philippe Verdy" <verdy_p@wanadoo.fr>; <unicode@unicode.org>
    > Sent: Tuesday, May 20, 2003 2:14 AM
    > Subject: Re: Computing default UCA collation tables
    >
    >
    > > This is a very long document; I only have a few brief comments.
    > >
    > > 1. Much of the UCA is derivable from a simpler set of orderings
    and
    > > rules. The format of the table, however, is intended to make it
    usable
    > > without having a complex set of rules for derivation.
    > >
    > > 2. However, you or anyone else could make a modified version which
    was
    > > simplified in one way or another. For example, ICU preprocesses
    the
    > > table to reduce the size of sort keys (see the ICU design docs if
    you
    > > are curious: oss.software.ibm.com/icu/). There are other ways that
    > > someone could preprocess the table. For example, you could also
    drop
    > > all those characters whose weights are computable from their NFKD
    > > form, for example, and then compute them at runtime.
    >
    > As far as I know, ICU does not preprocess the "allkeys.txt", but
    uses its own "FractionalUCA.txt" file, probably manually edited after
    some intermediate parsing (also it is currently derived from Unicode
    3.2, because "allkeys.txt" was still not created for Unicode, probably
    because such processing is difficult or impossible to produce
    automatically).
    >
    > That's why I wondered how the "allkeys.txt" was really produced,
    because it uses some weights ordering that is not documented in the
    UCA collation rules specification (the only thing that is normative,
    the "allkeys.txt" being just informative and a correct implementation
    of the specified rules).
    >
    > > 3. Scattered in and among your analysis are points where you
    believe
    > > there is an error. I'd like to emphasis again that the UTC does
    not
    > > consider arbitrary email on the mailing lists on its agenda. If
    there
    > > are items that you would like to see considered, you can extract
    them
    > > (and their justification) from this document, and use the feedback
    > > mechanism on the Unicode site to submit them.
    >
    > Yes my message was long, but I wanted to show the many points coming
    from the analysis of the "allkeys.txt" proposed as an informative
    reference, and wondered how to simply create a conforming collation,
    without importing the full text file (which is not only very large for
    an actual implementation, but also incomplete face to Unicode 4, and
    implements some custom tailorings that are NOT described in the UCA
    reference, still incomplete and probably contains a few incoherencies,
    proving the fact that this file was edited manually, and may contain
    errors or other omissions).
    >
    > However, analyzing how the table was produced allows to create a
    simpler "meta"-description of its content, where this file could be
    generated from a much simpler file (or set of files), so that such
    large table could be more easily maintained (even if there are some
    manual tailoring for specific scripts, or for scripts that still don't
    have any coherent collation order, such as Han).
    >
    > So despite I think that this table MAY be useful for some
    applications, I still think that it is not usable in the way it is
    presented.
    >
    > Also my preious message clearly demonstrated that this collation
    table uses some sort of "collation decomposition" which includes some
    collation elements that can be thought as "variants" or "letter
    modifiers" for which there is no corresponding encoding in the
    normative UCD with an associated normative NFD or NFKD decomposition.
    >
    > The current presentation of this table (with 4 collation weights per
    collation element) does not ease its implementation, and a simpler
    presentation with a unique weight (selected in a range that clearly
    indicates to which collation level it belongs to) would have been much
    more useful and much simpler to implement as well.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 13:31:28 EDT