Re: Decimal separator with more than one character?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 18:50:13 EDT

  • Next message: Philippe Verdy: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

    From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
    > Well, countries and languages are allowed to have different names. You are
    > from the United States but you don't speak Unitedstatesian...

    I know that at some time, China made a request that the correct way to write China and Chinese in English should be Zhina and Zhinese (according to the official Pynin(?) phonetic transliteration of Han to the Latin script)...

    This failed, but the "zh" symbol was adopted for the ISO629 language code (instead of "cn" used in ISO646-1 for the country code), but we all know that ISO629 was lacking a serious policy for allocation of codes, and that it accepted many incoherent changes or duplicate codes (famous examples include codes for Hebrew, Indonesian, Norwegian Bokmal/Nynorsk, or more recently the nightmare of codes for Serbo-Croatian/Serbian/Croatian/Bosnian).

    China made other successful requests for Pékin, the traditional French name (now written Beijing both in French and English, despite everybody continues to say "Pékin" in French and few people would associate it to Beijing), Canton (now Guandong ? I'm not even sure of the official French orthograph as everybody says "Canton")...

    The same thing also happened for Burma (Birmanie in French), but it succeeded and the script, language, and country is now called Myanmar both in French and English official names (but few French people know where Myanmar is located despite they know where "Birmanie" is).

    Or for "Laos" (old French orthograph) now simply officially written "Lao" (no problem to identify it, even if everybody would say first "Laos" with a pronounced final S).

    There are also attempts to require using some diacritics needed for complete transliteration to Latin, but this has almost always failed for technical reasons (lack of such diacritics in English or even in French). So official transliterations face the problem of tradition in other languages (or simple technical reasons like the lack of such symbol on their official keyboard).

    I really think that it's good to preserve as much as possible the official orthograph or transliteration, as long as
    - this does not introduce technical or cognitive problems in the target language (such as inexisting diacritics in the native repertoire of that language, for example a upper hook or a caron diacritic in French, or even some base Latin letters like the old anglosaxon "thorn" which looks like it is not a "Latin" letter, and was just invented to map a Runic letter with some Latin typographic rules),
    - or a name nearly impossible to pronounce by native target speakers (that aren't used to pronounce a three consecutive consonnant phonemes/letters),
    - or very unusual phonetic transformations when reading it, as it would require knowledge and some common use of foreign reading rules:

    1) For example a diaeresis over a vowel will often be replaced by another equivalent orthograph using multiple vowels or diphtongues in French or English, as the dieresis is used very differently in French to detach the vowel from a previous letter with which it would combine when reading (for example in "aiguë" to avoid that the final and not pronounced feminine "e" hides the intermediate "u" normally used after a "g" to avoid pronouncing it as a "j"), exactly like the diaeresis used in strict English for "coöperate" to avoid grouping the two "o" in a single and modified sound "u".

    2) The "ess-tsett" (sharp s) German ligature of the long form s (the initial or medial form of this letter in Old French, Old English and Old German) and the "standard" final s will systematically be replaced in modern French or English by a pair of standard "s" (the standard form was traditionally only used for isolated or final forms). However this long form s has survived in German within the ligature with a following s, and only if the first s is not initial, but its usage rule was modified a bit to accept (as exceptions only) another termination letter after the ligated pair for some conjugated verbs. In Unicode, this ligature is encoded as a single letter for compatibility with ISO-8859-1 (despite its uppercase form restores the semantic of two S letters): theorically it should have been better encoded as a long-form s followed by the standard s, and the ligature should hae been canonically or should hae benefited from a compatibility decomposition.

    3) Same thing for the "oe"and "ae" ligatures; the French "oe" benefits from a compatibility decomposition simply because it was not part of ISO-8859-1, but "ae" does not have the same feature. This decision should have been done independantly of the language-specific needs for collation (look for example at some existing decompositions like "ch" in Spanish or "c'h" in Breton, or "dz" in Polish, where this decomposition in base letters does not forbid to create a correctly localized collation, that consider these ligatures as plain letters in those languages, or as typographic ligatures and/or script limitations in others).

    Note that "oe" and "ae" (because they are ligatures in the same grapheme cluster) have a common casing for both parts of the ligature (this is why some languages consider them as plain letters that collate separately, and why other collate them as two distinct letters for example in French where the ligature is considered mostly typographic, but still needed as much as possible by the official orthograph).

    This is unlike the (Spanish) ch or (Breton) c'h or (Polish) dz, when generating titlecased forms (where only the first letter of these invisible ligatures is uppercased), but that continue to keep them collated together just after their first component letter, and not with it at primary level: Spanish, Breton and Polish consider these letters as an unbreakable ligature similar to a single letter even if that ligature is not always visible with most fonts, as if this particular orthograph was caused by a lack of distinctive letters in the known Latin script, and that those languages have just inherited of a local limited typographic tradition based on another dominant language.

    This last case also happens in other languages that have been converted to the Latin script during the 19th century, depending on who initiated this conversion:

    1) When this transliteration was political, little searches where done by linguists and this created such "inaccuracies" in the Latin transliteration, but still it was adopted (the case of Breton is not definitive, as there still exists several orthographic systems using the same subset as French in the Latin script, notably because there also exists now several major dialects with distinct historic evolutions: this language was mostly transmitted orally in its recent history, because books and publications in Breton were prohibited for several centuries, despite Breton has an older litterary tradition than French, and is very closely related to Welsh with which there's mutual understanding, even in the current modern form that subsists now in southern Britain.

    2) When this transliteration was done by clever linguists, they often added diacritics to Latin letters.
    For example in Vietnamese where this was done quite "excessively" by using multiple diacritics to preserve as much as possible the local phonetic; if this "alphabetization" of traditional ideographs had been done by Brahmic linguists, a more elegant solution would have used the Brahmic system of vowel signs and consonnant modifiers, the way it was done for historically for Thai, Lao and Myanmar scripts (and in part for the Hiragana and Katakana scripts in Japan, where this system is very simplified and difficult to see in the subsisting simple scripts that ignore many phonetic variants due to important regional phonetic variants in modern Japanese).



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 19:28:54 EDT