Re: polytonic Greek: diacritics above long vowels á¾±, á¿‘, á¿¡ from Philippe Verdy on 2013-08-06 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 7 Aug 2013 00:44:27 +0200

2013/8/6 Richard Wordingham <richard.wordingham_at_ntlworld.com>

> On Tue, 6 Aug 2013 19:27:56 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> > But there's an admitted exception : sorting with UCA may change the
> > relative order between the source strings, simply because sort
> > stability is not always wanted (it has a cost), and binary sorting
> > the results using the code point values as an additional collation
> > level is not always wanted, and normalization remains optional in
> > UCA.
>
> No

Why no? You don't give any real objection here, because you modify the
rules. LEt's suppose that you use a stable sort algorithm and that the
compare function does not normalize the input, but still performs UCA
comparison up to some strength level (e.g. only at primary level), and then
just performs a simple binary sort, the list will not be be necessarily
listing items in the same ouput order : if you convert this soeted list as
a single plain-text with one lie per item (assuming that items do not
contain any newline) and separate these items by newlines. And then you
normalize this plain text file, then the resulting sorted then normalized
list shoudl compare exactly equal to the sorted list generated from
indivudually normalized input data.

i.e. we should assert:

implode_list(sort(items[], UCA_compare_function)) ==
implode_list(sort(normalize(items[]), UCA_compare_function)

But the sort() function here is not required to replace the input by its
normalized version so that only normalized strigns will be seen by the
compare_function (an UCA compare function does not modify its input, it
will normalize it *internally* before replying that item1 is lower than,
equal to, or higher than item2). The sort function just swaps the items
without modifying them, and outputs them exactly in the same form that was
present in the input. So the first expression (before the equals sign) will
not be necessarily canonically equivalent to the second expression, even if
the two inputs:
items[]
and
normalize(items[])
are canonically equivalent (item per item in each input array of texts,
both having the same total number of items).

The canonical equivalent od the two ouputs require that inputs are all
normalized FIRST before sorting. But UCA sorting does not have to perform
this and generally it won't do that, even if it performs an internal
normalization, unless this internal normalization replaces the data of the
input before sorting it; but then it won't be able to output the original
items : let's suppose that the input contained ONLY unique elements without
duplicates and that they were unique identifiers, the replacement of the
input list with normalized identifiers will violate the uniqueness
constraint. If you count the number of unique elements in the output it may
be *lower* than the number of unique element in the non-normalized input
source.

In summary, UCA-sorting results in canonically equivalent outputs (same
order) *only* if the input is warrantied to be already normalized (using
the same normalization form, which could be NF(K)C, or NF(K)D, or any other
non-standard normalization that generates exactly the same binary ouput
from all possible inputs that are canonically equivalent), or if there's no
pair of items that are considered "equal" by the UCA_compare_function).

Even when using an UCA_compare_function with its maximum strength there are
still lots of binary distinct items which are compared equal at all
strengths by UCA collations, even when the collation is tailored using UCA
rulesets. And this exactly concerns ALL pairs of items that are canonically
equvalent but are using different normalizations forms (their difference is
only binary at the codepoint level).

Note however that the UCA_compare_function is conforming : in this case. it
will return "equal", but it has nothing to do with the sort function which
actually do not change items, but swaps them conditionally according to the
result (lower, equal, higher) of the compare function.

And what we call "UCA sorting" is a standard binary sort using an
conforming UCA compare function. UCA sorting by itself, with just this
definition is then **not necessarily** a conforming process, even if the
UCA compare function used is conforming...
Received on Tue Aug 06 2013 - 17:53:16 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 06 2013 - 17:53:18 CDT