On Jan 27, 9:36, Mark Davis wrote: > This message was empty. I had sent a lengthy, three part, contribution to unicode@Unicode.ORG, and (as a cc:) to the original poster. The latter has acknowledged receipt of the whole message. I have not checked the version unicode@Unicode.ORG has re-distributed, though. Maybe, this service has not processesed the MIME attachement, properly. In case you have not received my recent posting, you will find it below, in a single stream, without MIME attachements. (Though, of course, the whole message will be MIME transfer-encoded as Quoted-Printable, for obvious reasons). I have removed two lines from the test cases that will no longer be needed. If you already have seen my contribution, please ignore this repetition. On Jan 24, 18:17, unicode@Unicode.ORG wrote: > Do you know where and how to get test data files for collation tests (in > Japanese, German and French). I'll append some German words (or short phrases), sorted according to the German standard DIN 5007. > do you know how to create this kind of data You can sort the attachment of this note according to the second field (starting at the colon); this will permute the records to challenge your sorting program. > and how to check the sorted result? Sort the data according to the 1st field (up to the colon), ignoring the stops after abbreviations (cf. below); compare the result with the original file. Another source of pre-sorted German words is the vocabulary which is part of the official regulation of the forthcoming German spelling reform, from through . This vocabulary contains only basic word forms, and not many compound words; hence it does not contain many examples of DIN 5007's fine points, as my carefully hand-crafted :-) example data do. On the other hand, it contains much more entries, so it may serve as a performanvce test. Best wishes, Otto Stolz ----------------- Outline of DIN 5007 DIN 5007 requests a three-level sort. Every level takes the whole sorting- field into account, from left to right, as usual. 1. The 1st level considers all variants of a basic letter equivalent, e.g. upper and lower case, accents and other diacritical marks; ligatures are considered equivalent to the pertinent sequence of basic letters (in particular sharp-S, "ß", is considered equivalent to "ss"; Icelandic Thorn, "Þ", is considered equivalent to "th"). The space goes before any other character. Special characters may be ignored for the sorting. [Hence, stops after abbreviations may be ignored, as in my example data.] If they shall be included, all pronounced marks (such as "&") are considered equivalent and go after the space; likewise, all mute marks (such as punctuation marks) are considered equivalent and go after the pronounced marks (and, of course, before the "A"). Non-latin letters go after the Latin letters; each foreign alphabet is treated akin to the Latin alphabet (cf. 2 paragraphs above). The relative order of several foreign alphabets is left undefined. Oddly enough :-) DIN 5007 does not tell you how to sort ideographic characters. Numbers come after the letters; they are to be sorted in numerical order. 2. Only quasi-homonyms, i.e. entries not discriminated in level 1, are sorted in level 2. Here, upper and lower case of any letter are considered equivalent. However, level 2 discriminates amongst basic characters, ligatures, characters bearing diacritical marks, and similar variants. Level 2 defines the following collating sequence: basic letter, ligature, German umlaut (i.e. Ä, Ö, and Ü), dot above (except in "i"), gravis, macron, acute, breve, ring above, tilde, hachek, circumflex, stroke, oblique stroke, underscore, ogonjek, cedilla, trema (aka dieresis), double acute -- to name just the most common marks. 2. Only quasi-homonyms, i.e. entries not discriminated in level 2, are sorted in level 3. Here, upper case goes after lower case. You can achieve the same effect, if you compute an internal sorting-field from the given sorting-field, then sort according to the former, with any dumb sorting program, and finally remove it from the sorted records. This internal sorting field would consist of three sub-fields corresponding to the three levels outlined above. The leftmost sub-field would comprise the basic forms of the letters in the given sorting field (two letters for every ligature); the second sub-field would comprise the modifiers of the characters in the given sorting field; and the third subfield would comprise their respective cases (lower, or upper). To having special rules for numbers, I deem a bad idea in a sorting standard; hence I have not included any examples of numbers with my test data. ---------------- Test cases, sorted according to DIN 5007: arg : wicked; malicious ärger (comp. of "arg") : worse; more malicious Ärger : annoyance; anger ärgern : to annoy arglos : unsuspecting; innocent Aspirant : candidate Ass. (Assessor) : apprentice teacher or judge aß (praet. ind. of "essen") : [I / he] ate Aß (alternative spelling of "As"): ace Assel : slater, wood-louse Ast : limb Augiasstall : the Augean stables Äuglein : little eye; little bud Augment : augment Base : [female] cousin baß (archaic; poetic) : well; very Baß : bass Bast : phloem Busen : bosom; breast; bay Buße : atonement; fine Bussen (da. pl. of "Bus") : [to the] buses Bußen (no. pl. of "Buße") : atonements; fines Busserl : kiss es sei denn, daß : unless Esel : donkey esse (pres. conj. of "essen") : eat (e.g. in indirect speech) Esse : chimney; forge Eßecke : eating place essen : to eat Essen : Essen (town) Essenszeit : meal-time Essenz : essence Estland : Estonia (state) Fusel : cheap spirits Fuß : foot Füße (pl. of "Fuß") : feet Fussel : fluff fusseln : to wear of fluff füßeln : play footsie [under the table] fußen : to be based [on] Füssen : Füssen (town) Füßen (dat. pl. of "Fuß") : [to the] feet in Massen : in large numbers in Maßen : moderately Masern : measels Mass. (Massachusetts) : Massachusetts (state) Maß : measure Masse : mass Massé : (particular billard stroke) Maße (pl. of "Maß") : measures mäße (pres. conj. of "messen") : take measure (e.g. in indirect speech) Massen- : mass; wholesale; bulk (in compound nouns) massig : massive mäßig : moderate; modest Miss. (Mississippi) : Mississippi (state) Miß : Miss Passe : yoke (of dress) passé : over, gone Schlagerforderung : claim, demanded in a pop song Schlagerförderung : promotion of pop music Schlägerforderung : demand of a hooligan Schlägerförderung : promotion of hooliganism :-) Schurz : apron Schürze : apron Schussel : fidget; distracted person Schüssel : bowl Schuster : shoemaker Tropfen : drop troß! (obsolete) : (?) Troß : baggage train Trosse : hawser Trost : comfort; solace