Re: data for collation tests

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Fri Feb 07 1997 - 07:42:22 EST


Mark Davis writes:

> unicode@Unicode.ORG wrote:
> >
> > unicode@Unicode.ORG writes:
> >
> > > At 16:47 1997-1-24 -0800, Xiu Lu wrote:
> > > >Do you know where and how to get test data files for collation tests (in
> > > >Japanese, German and French). I have a program that can read a test data
> > > >file and does collation, then output a sorted result to a file. But I do
> > > >not have proper data to test this program .
> >
> > A Danish test case can be found at ftp://dkuug.dk/i18n/sort.ds377
> > (ISO/IEC 8859-1 encoded)
> >
> > Keld
>
>
> Looking at your list, I was surprised about a couple of things:
>
> 1. Apparently, as opposed to English, Danish sorts space and hyphen as
> separate characters, not as ignorable secondaries (e.g. ignored on first
> pass). For example, in English one sorts as in the following:
>
> black
> black-and-blue
> black and white
> blackbird
> black bird
> black-bird
> blackbirds
> black birds
> black-birds
> blackbox
> black-eyed pea
> blackfish
> black lung
>
> and NOT as:
>
> black and white
> black bird
> black birds
> black lung
> black-and-blue
> black-bird
> black-birds
> black-eyed pea
> black
> blackbird
> blackbirds
> blackbox
> blackfish
>
>
> Whereas in your Danish example, you have the latter approach:
>
> NIELS-JXRGEN
> NIELS JXRGEN
> NIELSEN
>
> On the other hand, "." is an ignoreable secondary in both English and
> Danish, as in your example.
>
> DSB
> D.S.B.
> DSC
>
> 2. In English dictionaries as a rule, uppercase comes before lowercase,
> as in:
>
> polish
> Polish
>
> Apparently, this is reversed in Danish dictionaries, for you have:
>
> Karl
> karl
>
> Mark Davis

Yes, Mark, you are correct in your observations.
These are intensional, and as specified in the Danish
Standard DS 377.

Regards
Keld Simonsen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT