On Jan 24, 18:17, unicode@Unicode.ORG wrote:
> Do you know where and how to get test data files for collation tests (in
> Japanese, German and French).

I'll append some German words (or short phrases), sorted according to
the German standard DIN 5007.

> do you know how to create this kind of data

You can sort the attachment of this note according to the second field
(starting at the colon); this will permute the records to challenge your
sorting program.

> and how to check the sorted result?

Sort the data according to the 1st field (up to the colon), ignoring the
stops after abbreviations (cf. below); compare the result with the original
file.

Another source of pre-sorted German words is the vocabulary which is part
of the official regulation of the forthcoming German spelling reform,
from    <http://www.ids-mannheim.de/grammis/reform/wort-a.html>
through <http://www.ids-mannheim.de/grammis/reform/wort-z.html>.
This vocabulary contains only basic word forms, and not many compound words;
hence it does not contain many examples of DIN 5007's fine points,
as my carefully hand-crafted :-) example data do. On the other hand,
it contains much more entries, so it may serve as a performanvce test.

Best wishes,
   Otto Stolz

-----------------

Outline of DIN 5007

DIN 5007 requests a three-level sort. Every level takes the whole sorting-
field into account, from left to right, as usual.

1. The 1st level considers all variants of a basic letter equivalent, e.g.
   upper and lower case, accents and other diacritical marks; ligatures
   are considered equivalent to the pertinent sequence of basic letters
   (in particular sharp-S, "ß", is considered equivalent to "ss"; Icelandic
   Thorn, "Þ", is considered equivalent to "th").

   The space goes before any other character.

   Special characters may be ignored for the sorting. [Hence, stops after
   abbreviations may be ignored, as in my example data.] If they shall be
   included, all pronounced marks (such as "&") are considered equivalent
   and go after the space; likewise, all mute marks (such as punctuation
   marks) are considered equivalent and go after the pronounced marks
   (and, of course, before the "A").

   Non-latin letters go after the Latin letters; each foreign alphabet
   is treated akin to the Latin alphabet (cf. 2 paragraphs above). The
   relative order of several foreign alphabets is left undefined. Oddly
   enough :-) DIN 5007 does not tell you how to sort ideographic
   characters.

   Numbers come after the letters; they are to be sorted in numerical
   order.

2. Only quasi-homonyms, i.e. entries not discriminated in level 1, are
   sorted in level 2.

   Here, upper and lower case of any letter are considered equivalent.
   However, level 2 discriminates amongst basic characters, ligatures,
   characters bearing diacritical marks, and similar variants.

   Level 2 defines the following collating sequence: basic letter,
   ligature, German umlaut (i.e. Ä, Ö, and Ü), dot above (except in "i"),
   gravis, macron, acute, breve, ring above, tilde, hachek, circumflex,
   stroke, oblique stroke, underscore, ogonjek, cedilla, trema (aka
   dieresis), double acute -- to name just the most common marks.

2. Only quasi-homonyms, i.e. entries not discriminated in level 2, are
   sorted in level 3.

   Here, upper case goes after lower case.

You can achieve the same effect, if you compute an internal sorting-field
from the given sorting-field, then sort according to the former, with any
dumb sorting program, and finally remove it from the sorted records. This
internal sorting field would consist of three sub-fields corresponding to
the three levels outlined above. The leftmost sub-field would comprise the
basic forms of the letters in the given sorting field (two letters for
every ligature); the second sub-field would comprise the modifiers of
the characters in the given sorting field; and the third subfield would
comprise their respective cases (lower, or upper).

To having special rules for numbers, I deem a bad idea in a sorting standard;
hence I have not included any examples of numbers with my test data.

----------------