Re: data for collation tests

From: Alain LaBont/e'/ ([email protected])
Date: Sun Feb 02 1997 - 14:39:32 EST


At 16:47 1997-1-24 -0800, Xiu Lu wrote:
>Do you know where and how to get test data files for collation tests (in
>Japanese, German and French). I have a program that can read a test data
>file and does collation, then output a sorted result to a file. But I do
>not have proper data to test this program .
>
>***********************************************************************
>* Xiu Lu 415-937-4595 (tel) *
>* Internationalization, Server products [email protected] *
>* Netscape Communications Corporation http://home.netscape.com *
>***********************************************************************

Sorry to have taken so much time to answer, I was submerged (and am still)
with messages and to-do's. Here is the benchmark of CAN/CSA Z243.4.1
Canadian ordering standard. The first list is the unsorted one that shall be
used as input for first hand testing. The second one is the prescribed
results following the rules of the standard. This is sorted according to
major French dictionaries. There are some extra non-French words also. Our
rules also sort English correctly according to English dictionaries that
have written and established rules (that said Michael Everson will tell you
that English-speakers learn that upper case is sorted before lower case [and
that his reference, the Concise Oxford English dictionaries does this in
practice, what is not verified though in the complete Oxford English
Dictionary that I have at home -- 325000 words -- as there are no specific
rules for case in the latter] . We chose in Canada to do what the English
dictionaries writing their rules do, hence harmonizing also with German,
which sorts lower case before upper case). French dictionaries have no rule
for case ordering but have precise, albeit arcane, rules for accents, which
we respect.

I had to use QP coding (unfortunately... this coding should not exist,
everybody should turn to 8-bit MIME) to make sure character bits would not
be stealed by criminally behaving servers [as the one used for this request
): ]. Sorry about this.

First list (unsorted):

ou
l�s�
p�ch�
vice-pr�sident
9999
O�
ha�e
coop
caennais
l�se
d�
air@@@
c�lon
boh�me
g�n�
lam�
p�che
L�S
vice versa
C.A.F.
c�sium
resum�
Boh�mien
co-op
p�cher
les
C�T�
r�sum�
�lborg
ca�on
du
haie
p�cher
Mc Arthur
cote
colon
l'�me
resume
�l�ve
Canon
lame
Boh�me
0000
rel�ve
g�ne
casanier
�lev�
COT�
relev�
Grossist
vice-presidents' offices
Copenhagen
c�te
McArthur
Mc Mahon
Aalborg
Gr��e
vice-president's offices
c�libat
P�CH�
COOP
@@@air
VICE-VERSA
g�ne
CO-OP
r�v�l�
r�v�le
�� et l�
No�l
�le
a�eul
�le d'Orl�ans
n�tre
notre
ao�t
NO�L
@@@@@
L'Ha�-les-Roses
C�TE
COTE
c�t�
cot�
aide
air
vice-president
model�
MOD�LE
ma�on
M�CON
p�che
p�ch�
pech�re
p�ch�re

Second list (sorted correctly):

@@@@@
0000
9999
Aalborg
aide
a�eul
air
@@@air
air@@@
�lborg
ao�t
boh�me
Boh�me
Boh�mien
caennais
c�sium
�� et l�
C.A.F.
Canon
ca�on
casanier
c�libat
colon
c�lon
coop
co-op
COOP
CO-OP
Copenhagen
cote
COTE
c�te
C�TE
cot�
COT�
c�t�
C�T�
du
d�
�l�ve
�lev�
g�ne
g�ne
g�n�
Gr��e
Grossist
haie
ha�e
�le
�le d'Orl�ans
lame
l'�me
lam�
les
L�S
l�se
l�s�
L'Ha�-les-Roses
M�CON
ma�on
McArthur
Mc Arthur
Mc Mahon
MOD�LE
model�
No�l
NO�L
notre
n�tre
ou
O�
p�che
p�che
p�ch�
P�CH�
p�ch�
p�cher
p�cher
pech�re
p�ch�re
rel�ve
relev�
resume
resum�
r�sum�
r�v�le
r�v�l�
vice-president
vice-pr�sident
vice-president's offices
vice-presidents' offices
vice versa
VICE-VERSA

Best Regards.

Alain LaBont�
Qu�bec

Project Editor, CAN/CSA Z243.4.1 (Canadian Ordering Standard for en and fr)
Project Editor, ISO/IEC 14651 (Ordering standard for UCS/UNICODE)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT