Re: accented Latin characters sort order, non-language dependant

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 11 2006 - 06:27:15 CDT

  • Next message: Adam Twardoch: "Re: accented Latin characters sort order, non-language dependant"

    And in postal addresses, München is most often written MUENCHEN (all capitals).

    this is Seen in official records at the United Nations (and other international institutions or organizations where Germany addresses are used)... Also in many databases. That's prefered simply because the character with umlaut could display wrong due to character sets conversions across databases.

    Note that I volontarily wrote the term in capitals to exhibit this fact. This is also seen on ads, and is very common in legacy ASCII-only texts (where MUENCHEN or Muenchen is prefered to MUNCHEN or Munchen to preserve the phonetic distinction). Road indicators in Germany (and also in Austria!) also write it "MUENCHEN" rather than "MUNCHEN" or "MÜNCHEN".

    What Austria is doing for ordering of entries in his phone directory is irrelevant here, because we're speaking about a city in Germany. I trust Germany for choosing the appropriate prefered orthography. And the German sort order is very well established since long which interprets the umlaut as a E, and not like an optional diaeresis diacritic. But I have seen ads within the yellow pages of an Austrian phone directory where "MUENCHEN" is used in displayed addresses. Also found in Austrian newspapers. The truth is that this is more a orthographic feature of German rather than simply a sort order problem.

    Note that dictionnaries always exhibit the complete orthography and not an abbreviated form, so the umlaut would be always present and shown; that's a good reason why it is possible to treat "ü" after "u" and not with "ue". But I note that german nouns that start with the "Über..." prefix sort them as "Ueber..." and not between "Ub..." and "Uc..." (the presence of the leading CAPITAL is an important distinction, because a umlaut over a capital U is often difficult to see distinct from a capital U without umlaut). Sorting it as "UE" makes a visual clue that the umlaut is present and required.

    But it was only an example of how language independant handling of diacritics is not as simple as simply dropping diacritics from the primary sort order (or primary collation level) for the Latin script. The truth is that there's no language independant sort order that works with all languages, and that the DUCET is only defining a possible reasonnable order that preserves most character identities but that does not consider complex cases like digraphs, or characters that have several representations in Unicode but are not canonically equivalent but still considered equivalent in some languages or cultures.

    ----- Original Message -----
    From: "Otto Stolz" <Otto.Stolz@uni-konstanz.de>
    > Philippe Verdy schrieb:
    >> In German, "München" (the city of Munich in Germany) sorts like "MUENCHEN", not like "MUNCHEN"...
    >
    > That is less than half of the truth.
    > In German, there are still two, mutually incompatible, sort orders in
    > use: In 1st approximation,
    > - in dictionaries, "München" would sort like "Munchen",
    > - in German phone directories, "München" would sort like "Muenchen",
    > - in Austrian phone directories, "München" would sort after "Muz"
    > (according to <http://de.wikipedia.org/wiki/DIN_5007> -- I do not
    > have 1st hand knowledge, in this particular case).



    This archive was generated by hypermail 2.1.5 : Tue Jul 11 2006 - 06:34:15 CDT