On Jan 24, 18:17, unicode@Unicode.ORG wrote: > Do you know where and how to get test data files for collation tests (in > Japanese, German and French). I'll append some German words (or short phrases), sorted according to the German standard DIN 5007. > do you know how to create this kind of data You can sort the attachment of this note according to the second field (starting at the colon); this will permute the records to challenge your sorting program. > and how to check the sorted result? Sort the data according to the 1st field (up to the colon), ignoring the stops after abbreviations (cf. below); compare the result with the original file. Another source of pre-sorted German words is the vocabulary which is part of the official regulation of the forthcoming German spelling reform, from through . This vocabulary contains only basic word forms, and not many compound words; hence it does not contain many examples of DIN 5007's fine points, as my carefully hand-crafted :-) example data do. On the other hand, it contains much more entries, so it may serve as a performanvce test. Best wishes, Otto Stolz ----------------- Outline of DIN 5007 DIN 5007 requests a three-level sort. Every level takes the whole sorting- field into account, from left to right, as usual. 1. The 1st level considers all variants of a basic letter equivalent, e.g. upper and lower case, accents and other diacritical marks; ligatures are considered equivalent to the pertinent sequence of basic letters (in particular sharp-S, "ß", is considered equivalent to "ss"; Icelandic Thorn, "Þ", is considered equivalent to "th"). The space goes before any other character. Special characters may be ignored for the sorting. [Hence, stops after abbreviations may be ignored, as in my example data.] If they shall be included, all pronounced marks (such as "&") are considered equivalent and go after the space; likewise, all mute marks (such as punctuation marks) are considered equivalent and go after the pronounced marks (and, of course, before the "A"). Non-latin letters go after the Latin letters; each foreign alphabet is treated akin to the Latin alphabet (cf. 2 paragraphs above). The relative order of several foreign alphabets is left undefined. Oddly enough :-) DIN 5007 does not tell you how to sort ideographic characters. Numbers come after the letters; they are to be sorted in numerical order. 2. Only quasi-homonyms, i.e. entries not discriminated in level 1, are sorted in level 2. Here, upper and lower case of any letter are considered equivalent. However, level 2 discriminates amongst basic characters, ligatures, characters bearing diacritical marks, and similar variants. Level 2 defines the following collating sequence: basic letter, ligature, German umlaut (i.e. Ä, Ö, and Ü), dot above (except in "i"), gravis, macron, acute, breve, ring above, tilde, hachek, circumflex, stroke, oblique stroke, underscore, ogonjek, cedilla, trema (aka dieresis), double acute -- to name just the most common marks. 2. Only quasi-homonyms, i.e. entries not discriminated in level 2, are sorted in level 3. Here, upper case goes after lower case. You can achieve the same effect, if you compute an internal sorting-field from the given sorting-field, then sort according to the former, with any dumb sorting program, and finally remove it from the sorted records. This internal sorting field would consist of three sub-fields corresponding to the three levels outlined above. The leftmost sub-field would comprise the basic forms of the letters in the given sorting field (two letters for every ligature); the second sub-field would comprise the modifiers of the characters in the given sorting field; and the third subfield would comprise their respective cases (lower, or upper). To having special rules for numbers, I deem a bad idea in a sorting standard; hence I have not included any examples of numbers with my test data. ----------------