Re: Computing default UCA collation tables

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 20 2003 - 12:26:59 EDT

Next message: Mark Davis: "Proposed Update: UTS #6: A Standard Compression Scheme for Unicode"

Previous message: Philippe Verdy: "Re: Computing default UCA collation tables"
In reply to: Philippe Verdy: "Re: Computing default UCA collation tables"
Next in thread: Philippe Verdy: "Re: Computing default UCA collation tables"
Reply: Philippe Verdy: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There are a number of errors in this account.

> As far as I know, ICU does not preprocess the "allkeys.txt", but
uses its own "FractionalUCA.txt" file, probably manually edited after
some intermediate parsing

I am pretty darn'd confident that ICU does preprocess the allkeys.txt
file, since I wrote the program myself. And there is no manual
editing.

>also it is currently derived from Unicode 3.2
It is generated from the allkeys.txt for 3.1, because there is no
allkeys.txt for 3.2; the next one will be for 4.0. It does use the
canonical decompositions for the latest version of Unicode; that is
described in C4 of
http://www.unicode.org/reports/tr10/tr10-9.html#Conformance

> because it uses some weights ordering that is not documented in the
UCA collation rules specification

While the UCA specification is not required to document each weight,
if there are particular instances that would be useful to document, we
can look at those. Please make a submission to that effect. And note
that the proposed update will be discussed at the upcoming UTC meeting
in June, so please make it well before then:

http://www.unicode.org/reports/tr10/tr10-10.html

>proving the fact that this file was edited manually, and may contain
errors or other omissions.

The file allkeys.txt *is* generated by a program that Ken Whistler
developed, call 'sifter'. It takes as input information about the
relative ordering of certain characters, plus special data for
characters that collate as if they were decomposed. (I have an
independent program that verifies that the output of the sifter meets
various consistency requirements, such as canonical equivalence,
transitivity, non-overlap.) The actual ordering that Ken's program
uses is on the basis of decisions made over time the UTC and WG20. The
file is not "edited manually"; it is generated from a much smaller set
of data.

>also incomplete face to Unicode 4
It is clear in the proposed update that an update of the data file for
4.0 is being prepared, but is not yet available.

> (the only thing that is normative, the "allkeys.txt" being just
informative and a correct implementation of the specified rules).

The 'allkeys.txt' data is *not* normative in the sense that it is not
required for any given language (it is expected that it will be
tailored for most if not all languages). However, it *is* normative in
the sense that if you claim to support the Default Unicode Collation
Element Table (in allkeys.txt), and yet do not produce the same
ordering as the specification would produce, you are violating C1. If
that is not clear from the text, we should make it so in the proposed
update.

Mark Davis
________
mark.davis@jtcsv.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <unicode@unicode.org>
Sent: Tuesday, May 20, 2003 07:55
Subject: Re: Computing default UCA collation tables

> From: "Mark Davis" <mark.davis@jtcsv.com>
> To: "Philippe Verdy" <verdy_p@wanadoo.fr>; <unicode@unicode.org>
> Sent: Tuesday, May 20, 2003 2:14 AM
> Subject: Re: Computing default UCA collation tables
>
>
> > This is a very long document; I only have a few brief comments.
> >
> > 1. Much of the UCA is derivable from a simpler set of orderings
and
> > rules. The format of the table, however, is intended to make it
usable
> > without having a complex set of rules for derivation.
> >
> > 2. However, you or anyone else could make a modified version which
was
> > simplified in one way or another. For example, ICU preprocesses
the
> > table to reduce the size of sort keys (see the ICU design docs if
you
> > are curious: oss.software.ibm.com/icu/). There are other ways that
> > someone could preprocess the table. For example, you could also
drop
> > all those characters whose weights are computable from their NFKD
> > form, for example, and then compute them at runtime.
>
> As far as I know, ICU does not preprocess the "allkeys.txt", but
uses its own "FractionalUCA.txt" file, probably manually edited after
some intermediate parsing (also it is currently derived from Unicode
3.2, because "allkeys.txt" was still not created for Unicode, probably
because such processing is difficult or impossible to produce
automatically).
>
> That's why I wondered how the "allkeys.txt" was really produced,
because it uses some weights ordering that is not documented in the
UCA collation rules specification (the only thing that is normative,
the "allkeys.txt" being just informative and a correct implementation
of the specified rules).
>
> > 3. Scattered in and among your analysis are points where you
believe
> > there is an error. I'd like to emphasis again that the UTC does
not
> > consider arbitrary email on the mailing lists on its agenda. If
there
> > are items that you would like to see considered, you can extract
them
> > (and their justification) from this document, and use the feedback
> > mechanism on the Unicode site to submit them.
>
> Yes my message was long, but I wanted to show the many points coming
from the analysis of the "allkeys.txt" proposed as an informative
reference, and wondered how to simply create a conforming collation,
without importing the full text file (which is not only very large for
an actual implementation, but also incomplete face to Unicode 4, and
implements some custom tailorings that are NOT described in the UCA
reference, still incomplete and probably contains a few incoherencies,
proving the fact that this file was edited manually, and may contain
errors or other omissions).
>
> However, analyzing how the table was produced allows to create a
simpler "meta"-description of its content, where this file could be
generated from a much simpler file (or set of files), so that such
large table could be more easily maintained (even if there are some
manual tailoring for specific scripts, or for scripts that still don't
have any coherent collation order, such as Han).
>
> So despite I think that this table MAY be useful for some
applications, I still think that it is not usable in the way it is
presented.
>
> Also my preious message clearly demonstrated that this collation
table uses some sort of "collation decomposition" which includes some
collation elements that can be thought as "variants" or "letter
modifiers" for which there is no corresponding encoding in the
normative UCD with an associated normative NFD or NFKD decomposition.
>
> The current presentation of this table (with 4 collation weights per
collation element) does not ease its implementation, and a simpler
presentation with a unique weight (selected in a range that clearly
indicates to which collation level it belongs to) would have been much
more useful and much simpler to implement as well.
>
>
>

Next message: Mark Davis: "Proposed Update: UTS #6: A Standard Compression Scheme for Unicode"
Previous message: Philippe Verdy: "Re: Computing default UCA collation tables"
In reply to: Philippe Verdy: "Re: Computing default UCA collation tables"
Next in thread: Philippe Verdy: "Re: Computing default UCA collation tables"
Reply: Philippe Verdy: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 13:31:28 EDT