Re: CLDR: Bad exemplar chars for some locales [ar,fa]

From: Jukka K. Korpela (
Date: Thu Apr 06 2006 - 15:04:10 CST

  • Next message: Mike Ayers: "Re: Decomposed vs Composed accented characters"

    On Thu, 6 Apr 2006, Peter Edberg wrote:

    > All of this hinges on the definition of what the exemplar set is supposed to
    > cover.

    Indeed. And this in turn should depend on the intended _use_ of this
    definition. How will the "exemplar character sets" be used in text
    processing and other applications?

    > From UTS #35 (LDML): "The exemplar character set contains the commonly
    > used letters for a given modern form of a language...

    It says that the "letter" concept be interpreted broadly, but I don't
    think we can count ZWJ, for example, as a letter without losing the whole
    idea of a "letter" as opposite to a "character". On the other hand, I have
    not seen any rationale for defining the set as a set of letters, or for
    the odd-looking name "examplar character set" for that matter.

    I think we can all imagine many possible uses for information about the
    use of characters in a language. The one mentioned first in the LDML
    specification, namely the choice of encoding, does not sound like a
    particularly important one in the Unicode context. The "charset
    conversion" usage looks odd ("'Character set' considered harmful"), and it
    probably means conversions between encodings. Then there's collation
    mentioned, but I fail to see the relevance.

    A considerably more informative definition is needed, and it should be
    something that different people around the globe can understand in a
    reasonably similar manner. I'm afraid the definitions of "exemplar
    character sets" become rather useless, if they are set up according to
    greatly varying criteria.

    The concept "collection of characters used in a language" is vague and
    fuzzy. The multitude of possible interpretations needs to be squeezed down
    to a small set of manageable definitions, though I'm afraid just two (the
    basic set and the auxiliary set) isn't quite enough.

    I hope the discussions can start before people waste far too much time in
    considering the sets and debating about them, without knowing what they
    are actually trying to define. I'd like to suggest a concrete starting
    point, namely to consider whether the following tentative definitions
    would be a suitable basis:
    - The set of characters that is regarded as the absolute minimum
       for writing a language, including punctuation and controls.
       Any application should support this set before it can be said
       to support the language.
    - The set of characters that are considered as the basic repertoire
       for use in orthographically correct writing of the language,
       without any ASCII-era compromises like ambiguous semantics for "-".
       This means roughly speaking the characters you can expect to find
       in books in the language for a general audience.
    I am mainly thinking of these as definitions to be used when selecting
    fonts, or when deciding whether some software can be characterized as
    supporting typing of the language, or when defining parameters for
    OCR scanning, or designing general-purpose input data checking for
    data in the language.

    Jukka "Yucca" Korpela,

    This archive was generated by hypermail 2.1.5 : Thu Apr 06 2006 - 15:07:59 CST