Re: data for collation tests

From: Alain LaBont/e'/ (alb@sct.gouv.qc.ca)
Date: Fri Feb 07 1997 - 11:00:58 EST


At 17:41 97-02-06 -0800, Mark Davis wrote:
>unicode@Unicode.ORG wrote:
>>
>> unicode@Unicode.ORG writes:
>>
>> > At 16:47 1997-1-24 -0800, Xiu Lu wrote:
>> > >Do you know where and how to get test data files for collation tests (in
>> > >Japanese, German and French). I have a program that can read a test data
>> > >file and does collation, then output a sorted result to a file. But I do
>> > >not have proper data to test this program .
>>
>> A Danish test case can be found at ftp://dkuug.dk/i18n/sort.ds377
>> (ISO/IEC 8859-1 encoded)
>>
>> Keld
>
>

[Mark] :
>Looking at your list, I was surprised about a couple of things:
>
>1. Apparently, as opposed to English, Danish sorts space and hyphen as
>separate characters, not as ignorable secondaries (e.g. ignored on first
>pass). For example, in English one sorts as in the following:

[Alain] :
Hurrah! At least there are suporters for what I have been preaching for
years... I always said that if in a field, a space is sorted with its
distinct weight at level 1, peopel (at least English-, French-, German-, and
a lot of other speakers would not retrieve intuitively what they are lokking
for [a problem people have all the time looking for my name when it is
written in a list with an imbedded space as "La Bonté"])...

Now TC37, NISO (ANSI), perhaps the European prenorm (I'm not sure any more)
and a few Scandinavian standards do sort space as a distinct *letter*... To
me and to Canadian standard CAN/CSA Z243.4.1 (and to French and English [and
many others'] dictionaries), that is a mistake... To separate family names
from firts names, one should use different fields, not spaces... However in
Canadian standard CAN/CSA Z243.4.1, for those who really mean it, they can
use NBSP which is a special space which has a weight coming before A... We
do not recommend it thgough, but as it is difficult to enter, it is there
for those who want to make the effort and really mean it!

[Mark's list, which I like, as it is intuitive, even if the Canadian
standard would sort space before dash, all other things apart, in case of
homography (: ]:

>black
>black-and-blue
>black and white
>blackbird
>black bird
>black-bird
>blackbirds
>black birds
>black-birds
>blackbox
>black-eyed pea
>blackfish
>black lung
>

[Mark's comment] :
>and NOT as:
>
>black and white
>black bird
>black birds
>black lung
>black-and-blue
>black-bird
>black-birds
>black-eyed pea
>black
>blackbird
>blackbirds
>blackbox
>blackfish

[Alain] :
Even if I don't find this list is intuitive, there might also have another
problem not due to their standard, as Arnold Winkler just made me remark:
isn't the sequence of "black" a mistake here ? Shouldn't it be in any
standard first ?

[Mark] :
>Whereas in your Danish example, you have the latter approach:
>
>NIELS-JXRGEN
>NIELS JXRGEN
>NIELSEN
>
>On the other hand, "." is an ignoreable secondary in both English and
>Danish, as in your example.

[Alain] :
Which appears to be OK...

[Mark] :
>DSB
>D.S.B.
>DSC

[Mark] :
>2. In English dictionaries as a rule, uppercase comes before lowercase,
>as in:
>
>polish
>Polish

[Alain] :
You mean lower before upper ? That's what some found indeed in English
dictionaries which state their rules explicitly when we made the Candians
standard. However Michael Everson will say that all young English-speakers
are taught the opposite when they learn English and that his concise Oxford
English Dictionary sorts Polish before polish (contrarily to American
Webster's [at least the Collegiate Dictionary I have which sorts polish
before Polish indeed; and in my complete edition of the OED, there seems to
be no rule at all!)

To harmonize English with German practice (and with French, which has
precise rules for accents but none for case), we chose the way that Mark is
showing in his example in Canadian standard CAN/CSA Z243.4.1

[Mark] :
>Apparently, this is reversed in Danish dictionaries, for you have:
>
>Karl
>karl

[Alain] :
That, I know, is correct. Danish sorts upper case first. That is their
choice, as logical as the other one. Anyway Danish requires different ordering.

Alain LaBonté
Québec

Project Editor, ISO/IEC 14651 Ordering Standard



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT