From: Keutgen, Walter (
Date: Tue May 16 2006 - 12:58:48 CDT

  • Next message: Deborah Goldsmith: "Re: CLDR"


    you are right, there is enough fog for people not in the CLDR team around that. Moreover applying your criterion below (newspaper), I felt that for German it would be politically correct now to include Polish characters. The space in the survey tool is limited, so I should have thrown out south European languages. So I left this.

    The tool uses the sets to flag textual data as erroneous if containing letters outside of the sets, but the team has decided to disregard this 'error' because many exemplar character lists have stayed empty.

    Best regards


    THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

    -----Original Message-----
    From: [] On Behalf Of Asmus Freytag
    Sent: Tuesday, 16 May 2006 15:57
    To: Jukka K. Korpela
    Subject: Re: CLDR

    On 5/16/2006 12:32 AM, Jukka K. Korpela wrote:
    > On Tue, 16 May 2006, Balasankar wrote:
    >> Whether the union of Exemplar & auxiliary exemplar character set
    >> should contain all the possible characters used in the particular
    >> language?
    > No. It is impossible to list down the characters used in a language;
    > the set is very fuzzy, with membership ranging from core characters
    > (such as "a" in English) through marginal characters (like "?", i.e.
    > "e" with acute, in English) to characters may appear in special words,
    > typically borrowings, perhaps _very_ rarely.
    At some point you run into the 'newspaper' issue: in some cultures,
    newspapers will preserve more of the spelling of foreign names (if they
    use the Latin script) than is common in US papers. While such names are
    not exactly borrowed words, they do form part of widely disseminated
    texts in that language. As a result, the set required to be able to
    handle 'texts accessed by ordinary users' in these cultures is quite
    large, and has lost any specificity towards a given *language*.

    I ran into that problem a decade ago when I dabbled in language recognition.
    > Moreover, these sets are currently supposed to list down _letters_
    > only. The two sets make it possible to give a rather rough description
    > of letters used in a language, and the choices made are often rather
    > debatable.
    > It isn't even clear what the intended _use_ of the sets is, or what
    > the actual use will be. There is a large number of imagineable uses,
    > with their own implications on what the grounds for defining the sets
    > should really be. I'm afraid the (mostly implicit) criteria applied
    > now make the sets incommensurable across languages.
    That's been my feeling as well, but every time I mention this to people
    who are at the core of the CLDR activity they assure me that there are
    such criteria (including a clear specification of the intended use). If
    that's the case, can anyone give a URL to them?


    This archive was generated by hypermail 2.1.5 : Tue May 16 2006 - 13:15:20 CDT