Re: String name and Character Name

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Apr 18 2005 - 03:42:48 CST

  • Next message: Philippe Verdy: "Re: String name and Character Name"

    From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
    > How about the following idea of overcoming the difficulty?
    > 1. Identify the characters with misleading official names.
    > 2. Define better names for them in the "en" locale, and preferably
    > in the "fr" locale as well.
    > 3. Enhance CLDR with the feature of combining locales, in the sense
    > that a user's locale choice can consist of a sequence of locales
    > in order of preference. For example, a user's choice could mean
    > "use the 'de' locale for anything defined there but the 'en'
    > locale for things that aren't define in the 'de' locale".

    This is a feature that I am wanting implemented in Java since long, instead
    of the too basic "locale resolution" algorithm that just strips successively
    the variant code, the region code and finally the language code (after
    retrying with the system locale). Notably there's still nothing to resolve
    correctly the script code (where to place it in a Java locale code? The best
    place would be to put it within the language code with a separator or using
    a lettercase convention or bundle resource names).

    > That way, when accessing a character with a misleading official name,
    > the information shown to the user would consist of its localized name
    > in the "en" locale (or maybe "fr" locale), unless a name has been defined
    > for it in the user's preferred locale.

    This is my suggestion I repeated here several times. The Unicode-hosted CLDR
    is probably the best place to archive these localized lists of character
    names. Yes it is a huge task, but we don't start from nowhere (the normative
    Unicode/ISO/IEC 10646 English and French names are there and we just need to
    correct the few "errors" or misleading names or inaccuracies for the English
    and French locales).

    Also, the CLDR does not need to be part of the standard (the normative names
    remain unchanged, so for example the normative English uppercase names would
    be those adopted in a "C" locale, and would still be used in the Unicode
    regexp "\N{name}" specifiers).

    Due to that, the CLDR can be updated many times to reflect the "best
    practice" for each language. The need for a consensus would be much less
    critical to advance in this project.

    After all, Unicode is also hosting another "huge" task with the UniHan
    database (related to Han characters) that is still far from being complete
    or accurate. Some parts of the Unihan database may also become part of the
    new localized name lists for Chinese, Japanese and Korean locales (with
    better and much more useful descriptive character names than the normative
    "English" Unicode character names that just consist in the hexadecimal
    codepoint); may be several defined UniHan database fields would be managed
    more easily in separate locales (for example a Chinese-Pinyin locale for the
    Pinyin name), and this would ease the construction of input method editors
    that allow sorting and selecting Han ideographs according to user
    preferences...

    So Unicode, ISO and UniHan already have at least three localized working
    name lists to start on, and "errors" reported in other languages could be
    better reflected in the localized name lists as well, even if not all
    characters are listed for all locales.

    An additional source of information is the subset of "representative
    characters" that form the correct alphabet of a language (already specified
    in ICU):
    - we could rapidly translate at least the names of these characters in these
    native written languages
    - and probably in IPA phonetic (enabling aural identification with speech
    synthetizers, in character selectors or spelling text readers), because it
    would often happen that this localized native name would often give only the
    letter in the name (like the Z in LATIN SMALL LETTER Z), in a separate
    locale data for oral speech.
    - IPA could also help translators to provide accurate localized
    orthographies of names of other characters that are foreign to the target
    languages (see for example the various orthographies of English or French
    names that are sometimes used for Hebrew or Arabic letters...)



    This archive was generated by hypermail 2.1.5 : Mon Apr 18 2005 - 15:56:56 CST