Re: DIN 5007, Swiss Sorting

From: Alain LaBonté  (alb@sct.gouv.qc.ca)
Date: Sun Mar 12 2000 - 21:56:23 EST


À 17:06 2000-03-12 -0800, Markus Kuhn a écrit:
>Michael Everson wrote on 2000-03-12 15:24 UTC:
> > Ar 20:32 -0800 2000-03-11, scríobh Alain:
> > >Michael Everson disagrees that for English lower case should be sorted
> > >first in case of quasi homographs (ex. : august before August), based on
> > >what he deduces from the short version of the OED.
> >
> > I don't just disagree or deduce, I gave actual evidence. The "short
> > version" Alain refers to is the Concise Oxford Dictionary of Current
> > English. See http://www.egt.ie/standards/iso10646/pdf/n688.pdf for a review
> > of dictionaries.
>
>What troubles me with all the work on the international sorting standard
>is that far too much emphasis is given on so-called "existing practice".
>Let's face it: there is no such thing. Extremely few people know or
>agree about the algorithmic details of the traditional sorting order in
>their respective locale. Very few countries have detailed formal
>standards (such as the German DIN 5007), and even in these countries
>these standards are so little known that lots of modifications could be
>made without anyone noticing. "Anyone" includes experts such as
>dictionary publishers, who tend to be not less confused than the
>ordinary person.
>
>The aim of the UCS sorting standard should be
>
> - user friendliness
> - easy to remember
> - helpful in manually locating words in huge sorted lists
> - practical
> - efficiently implementable
> - consistent and simple across may different languages and scripts
>
>and alternatives should please be discussed in these terms and not in
>terms of compatibility with this or that dictionary.
>
>Compatibility with the precise details of existing national standards is
>completely irrelevant, unless more than 0.1 % of the population in the
>respective locale are actually familiar with this practice.
>
>E.g., I have asked over a dozen French people (academics and frequent
>users of dictionaries) about the in my eyes very unexpected idea of
>sorting French accents in reverse order (last character most
>significant), and I have not yet found anyone (excect i18n experts who
>followed the discussion here) who knew about these rules. I have not yet
>found a single person ever who was able to explain to me, why it is a
>good idea to sort French accents in reverse at all. Is there any, apart
>from that some dictionaries seem to do it?
>
>Perhaps, it turns out in the end to just have originated as a
>programming error in the software of some dictionary publisher, and now
>this is kept on forever without ever being questioned again.
>
>Again: if you argue for A < a versus a < A, then please explain me why
>one is better than the other, and please do so independent of existing
>practice.

[Alain] I think, Markus, that you should be satisfied with my answer as
there is nothing I can tell you more than below, it is in dact simple in
principle.

The fact is that it is so simple that people do not need to know the
details. They learned their alphabet. Full stop. People in France do not
know the rules of quasi-homographs dictionary ordering as when they search
these, they find them gathered together and have no problem with the list,
regardlesss of the fine details of this order.

But computers have to be told that an a is an A is an À and sometimes an Å,
i.e. in all countries but not in Scandinavia where the latter is a distinct
letter sorted after Z. This is what people expect, i.e. they do not expect
excatly the same thing in all countries. Many languages are compatible
sort-wise, others are not, for the same script.

ISO/IEC 14651 gives a reasonable template but there is no such thing as a
default, we know that, a delta needs to be declared. The template may be
used as a common method, but it will need adjustments to cater for local needs.

Now that we have solved the issue of simplicity for humans, let's continue.

Computers need to sort an a before an A or the reverse to make the sort
predictable. They need to distinguish quasi-homographs (woords which
differe just because they ahave accents or case or imbedded special
characters).

The latter is based on existing practice, dictionary practice being the
primary sort order. Other sorts are more complex and sometimes even
context-dependent.

Nothing to be troubled about, at the end, in my humble opinion.

Thanks for asking the question. I believe all the principles you stated are
respected in ISO/IEC 14651.

Alain LaBonté
Charlesbourg

PS: for the reason why accented quasi-homographs are sorted in considering
ending accents more important than those at the beginning of words, it is
easy: the rule says that accents are not considered for sorting but that
whenever there are homography at the beginning of words it is the following
that determines an order:

In cote and coter, homography exists up to the r of the second word, it is
the r that determines ordering. In cote and coté, it is the é of the second
word that determines ordering. They used the same principle for côte and
coté, and so on.

Furthermore in French, in many instances, there are accentuation variations
within the word, but the ending is, as a general rule, always more
significant at the end in French (if accents are not shown on upper case
-- bad prcatice but it has existed -- it was never a good idea to remove
the ending accent which really brings meaning, the others bringing
pronunciation hints, mainly).

For case, French has no real proven logical practice, and it is debatable
that English has one, but German has one, which is very well described in
Duden dictionaries, and which is not new, and very well established (lower
case comes before upper case in case of quasi homography).



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT