"Deterministic Sorting" (was Re: ZWNJ & Persian Collation)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Mar 13 2003 - 16:04:19 EST

    I want to point out two things.

    1. UCA provides a mechanism for producing a "deterministic" sort (there
    called semi-stable). See step 3.10

    2. A "deterministic" sort is actually not needed very often; people confuse
    it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability

    3. If someone did customize the UCA for numeric sorting, the difference
    between 002 and 2 could be a tertiary difference. So even without using
    3.10, they would be distinguished at level 3.

    > Roozbeh Pournader wrote:
    > > Well, anything that is completely ignored in collation creates problems
    > > with deterministic sorting.
    > I don't think you mean "deterministic". UCA is deterministic, it just
    sorts many strings as equal.
    > > There are certain words in Persian, with
    > > completely different meanings, that only differ in a ZWNJ[1]. Having
    > > ignored by default, means they may appear in this or that order,
    > > based on the original order of input. I guess this is not what we want
    > > for deterministic collation.
    > >
    > > The desired behavior for ZWNJ, is being treated like punctuations.
    > > Ignored in the first levels, but considered at the end. (Personal Note:
    > > write something for UTC on this.)
    > Possible. I assume that ZWNJ is ignored in UCA because that is the
    expected behavior for many other
    > languages. Not ignoring ZWNJ is possible with a tailoring that gives it
    some non-zero weights.
    > Note that many languages require tailorings for at least a couple of
    characters to follow national
    > standards.
    > markus
