Re: CLDR errors that can't be corrected

From: Erkki Kolehmainen (
Date: Fri May 26 2006 - 03:00:32 CDT

  • Next message: N. Ganesan: "two-tier web"

    I can only agree with many of Philippe's concerns. I'd like to make some
    comments, though:

    The CLDR process is intended to create complete snapshots at given times
    so that the various vendors can start at the same time utilizing the
    data for all the languages that they support. Releasing the data
    language-by-language would cause a tremendous logistical burden and
    administrative overhead. The next release is expected to have a shorter
    preparation time, thanks to the introduction of the Survey Tool.

    No vendor is likely to support all the languages, and they may well
    choose to support only the languages that have only fully vetted data,
    although that may be a too far-fetched conclusion. It should be noted
    that there _are_ current implementations that are often based on
    considerably more erroneous data, so a new CLDR release _is_ likely to
    bring forth an improvement, which should not be delayed indefinitely.

    When looking at the submitted data, I get the feeling that Philippe's
    proposals are often more correct from a linguistic point of view than
    the competing proposals. This, however, is not necessarily always the
    best way to express things in "computerese". Thus, e.g., langues chames,
    langues créoles et pidgins (autres), langues indo-aryennes (autres),
    langues iroquoises, langues khoïsans (autres), langues manobos, langues
    mon-khmères (autres), langues moundas, etc. most certainly (even with my
    highly limited knowledge - I wouldn't dream of doing any vetting of the
    French data) are expressed in better French than having the sequence
    reversed. The reversed sequence, however, is preferred by many for use
    in e.g., ordered lists.

    The CLDR data is intended to be the practical compromise between, say,
    what is formally correct and what is actually being used (e.g. in most
    country names), and one cannot expect full consensus on many of the
    elements. Nevertheless, some default value is always wired into the
    systems, and I trust that most users would not want to see a fall-back
    to the code or the expression in English.

    Erkki I. Kolehmainen

    Philippe Verdy wrote:

    > After rereading the email I sent, I think I must ask excuse if the tone of my text is not enough respectful of your work. In fact in the last few months, most activities have been concentrated on the new coming Unicode 5.0 version, and testing the BETA, or discussing it. The period for closing the CLDR at the same time as Unicode 5.0 release is probably not the best choice.
    > There has even been a period where the Unicode websites were completely inaccessible (servers down at its colocation area).
    > And also, the CLDR is still a child not completely born, with lots of beta data, frequent changes in the format, new aliases, and even the CLDR website does not allow completely testing all cases.
    > The site may also need ways to clean our own errors or changes, instead of continuing to display options we created ourselves, and then we chose to not support. Multiplying the options available on screen just complicates the validation of our own data. So why can't we simply remove the items we created and that we no longer support?
    > There should be a way to look in our submissions if there are items that we have still not reviewed, or those for which there exists conflicts of opinions with other users (may be they are right, may wecanfind an alternative compromize that may satisfy multiple users).
    > My opinion is that there's no emergency to close all languages simultaneously. And the vetting process, once it is started, should include a "Reject" option instead of just a "confirm" option. If things are completely frozen for a next version due to absence of data in the open period, then there should be no radio-button at all, but a link to the Unicode Report form where only obvious bugs will be treated, either immediately, or given consideration later.
    > Or the alternative would be to leave the submission forms open, but they will not be part of the next version, except if notable errors are reported and need to be corrected using the ongoing proposals.
    > Ithinkk it's aillusory to think that all locales can be verified at the sametime with the same schedules. So the vetting process should start after there has been some change proposals since the last release, and enough time has been given to correct things. it's a fact that the current process uses a very slow release cycle of several months, but the method used creates emergency hotpoints in the current schedule, which does not facilitate the quality of submissions (and notably when the CLDR structure has changed so much like in the last few months, with many corrections, new aliasing model, new coherence checks, ...)
    > For me the CLDR will be a useful tool in a long term, but it's really too soon to schedule it as if it had produced a coherent standard that must be maintained with stability rules like the Unicode standard and the UTC/WG2 working group schedules. For now there are still too many differences between various sources and related standards (including in ISO standards themselves, like the various orthographies of ISO3166 countries, ISO639 languages, ISO15924 scripts, ISO10646/Unicode character names and block names, the toponomy of timezones, normalisation of singular/plural forms, uniform separators for alternate names...). Things are going better, but they are not finished and not even stable.

    This archive was generated by hypermail 2.1.5 : Fri May 26 2006 - 03:05:50 CDT