From mark.edward.davis@gmail.com Tue Apr 1 08:14:45 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 01 Apr 2008 08:14:45 -0600 (CST) Received: from ti-out-0910.google.com (ti-out-0910.google.com [209.85.142.190]) by unicode.org (8.12.11/8.12.11) with ESMTP id m31EEZsa006293 for ; Tue, 1 Apr 2008 08:14:45 -0600 Received: by ti-out-0910.google.com with SMTP id 28so742007tif.11 for ; Tue, 01 Apr 2008 07:14:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; bh=vUEslHoKitl7a7ZDP6uUPuepTBD3tZ/tcNdAx1eivkU=; b=lCEopqWb2F/BY7/2AuXV3MkawjEeaA92/8rWycz82ZY/ywTnkP9IOuCu8F621nc11GIHqyXSEMqlApK8R2lOvvrHxbdMNXl9zgVWYLJAe6wyfzWH7UXAgwcBKVoA9XC+XV4F1T8WjMNKCXeVX/6CLYGCBklYnbhWt2vWt8k/Y3M= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; b=nP7FDtUHOA601mwpBznkpNwiBDZJIBZxGCMoUsSAA7n4Ju68T+nls+DrgNSr4gh/qnNajebuTNjomix8PnWFtyVnP6NoW2iTuK+pfN68W2Q/oV4zEcc3LNLD4ZzJtDf7IT6+EonkrKtb8exsX7XQFMu01DxvMNDJ21hVjdnluFQ= Received: by 10.150.92.11 with SMTP id p11mr4169398ybb.105.1207059269124; Tue, 01 Apr 2008 07:14:29 -0700 (PDT) Received: by 10.150.229.9 with HTTP; Tue, 1 Apr 2008 07:14:27 -0700 (PDT) Message-ID: <30b660a20804010714j26eb8143n8e5ce5f2c9a9b5cc@mail.gmail.com> Date: Tue, 1 Apr 2008 07:14:27 -0700 From: "Mark Davis" To: "CLDR list" , "cldr-users@unicode.org" Subject: Date formats with months MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_3137_25641609.1207059269084" X-Google-Sender-Auth: 09e1b50bc6f6c400 X-archive-position: 448 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: mark.davis@icu-project.org Precedence: bulk X-list: cldr-users ------=_Part_3137_25641609.1207059269084 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline There is a way to make this work. We allow for two forms of every month name. The "format" style is what should occur in formatted dates, like "14 d'abril de 2008". The month value there can contain an inflected month or add prepositions to the month, so value for that in Catalan could be "de X" or "d'X" depending on the month X. The date format, however, needs to be correspondingly adjusted to remove the "de ". The "stand-alone" style is what the month should look like if it occurs by itself, such as at the top of a calendar. Mark On Tue, Apr 1, 2008 at 12:22 AM, Artur Klauser wrote: > I'm wondering if there is any way to make full dates orthographically > correct in Catalan. > > The general date format is: de de > However, Catalan contracts "de" with the following word, if that word > starts with a vowel, > e.g. it would be "14 de gener de 2008" but "14 d'abril de 2008". > > ... > > -- Mark ------=_Part_3137_25641609.1207059269084 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline There is a way to make this work. We allow for two forms of every month name. The "format" style is what should occur in formatted dates, like "14 d'abril de 2008". The month value there can contain an inflected month or add prepositions to the month, so value for that in Catalan could be "de X" or "d'X" depending on the month X. The date format, however, needs to be correspondingly adjusted to remove the "de ".

The "stand-alone" style is what the month should look like if it occurs by itself, such as at the top of a calendar.

Mark

On Tue, Apr 1, 2008 at 12:22 AM, Artur Klauser wrote:
I'm wondering if there is any way to make full dates orthographically correct in Catalan.

The general date format is: <day> de <month> de <year>
However, Catalan contracts "de" with the following word, if that word starts with a vowel,
e.g. it would be "14 de gener de 2008" but "14 d'abril de 2008".

...



--
Mark ------=_Part_3137_25641609.1207059269084-- From verdy_p@wanadoo.fr Tue Apr 1 09:32:13 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 01 Apr 2008 09:32:13 -0600 (CST) Received: from smtp23.orange.fr (smtp23.orange.fr [193.252.22.30]) by unicode.org (8.12.11/8.12.11) with ESMTP id m31FWCYG029763; Tue, 1 Apr 2008 09:32:12 -0600 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2336.orange.fr (SMTP Server) with ESMTP id 833321C0008A; Tue, 1 Apr 2008 17:32:06 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2336.orange.fr (SMTP Server) with ESMTP id D196B1C00086; Tue, 1 Apr 2008 17:32:05 +0200 (CEST) X-ME-UUID: 20080401153205858.D196B1C00086@mwinf2336.orange.fr Reply-To: From: "Philippe Verdy" To: "'Mark Davis'" , "'CLDR list'" , References: <30b660a20804010714j26eb8143n8e5ce5f2c9a9b5cc@mail.gmail.com> Subject: RE: Date formats with months Date: Tue, 1 Apr 2008 17:31:44 +0200 Organization: Ordinateur Personnel Message-ID: <033e01c8940d$7e1fd440$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_033F_01C8941E.41A8A440" X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <30b660a20804010714j26eb8143n8e5ce5f2c9a9b5cc@mail.gmail.com> Thread-Index: AciUCHfDsENxWCZuS1etL67KpxFS1wAAEazA X-archive-position: 449 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users This is a multi-part message in MIME format. ------=_NextPart_000_033F_01C8941E.41A8A440 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable It's probably much simpler to not specify any "stand-alone" variant, and just use the same variant as with the "format" style, when the month = names do not change their orthography, and then insert relevant words with = "de" within the date format within single quotation marks. =20 However the problem exposed here is that the month name may start with a vowel, and so it would contract the article in prefix; it is similar to = the case where the name name takes a genitive mark and its orthography = change. For that case, the "format" style should be used, so that it will = integrate the genitive mark or the (name-dependant) "de" or "d=92" or similar = prefix. =20 The current CLDR structure currently inherits the aliasing (by default) between the "format" and "stand-alone" styles. However this aliasing was created (in the Root locale) in either direction ; this causes confusion when entering the date elements about which direction occurs and which = item must be changed when only one form is needed. When sorting the elements = by code, the aliased items (where the aliasing is inherited from Root) are mixed with the non-aliased items, and this does not exhibit very well = the structure of this alising which becomes confusive =20 My opinion is that no aliasing between styles should occur for the = "format" style which should be the default style to which the default aliases = will point (the aliases will have to be deleted and changed to a specific = value in those locales that need distinct variants). I think that this could = be changed automatically (no vetting needed) to swap the direction of those aliases. But be warned that this could create a chain of aliases if the current "stand-alone" style is also referenced by another style than "format", when this pair will be swapped with a modified target for the = new alias; such extra aliases may exist in Root between "narrow" (limited to 1-char only and not necessarily unique), "abbreviated" and "wide" forms; = as chaining aliases will have a cost at run-time, their topologic structure should be maintained as logically as possible and coherently for all locales. If not designed carefully, these chained aliases may create infinite loops (and this may be the reason why they aliases cannot be created in the CLDR Survey tool, but just be replaced by actual values). =20 I posted yesterday a bug report speaking about these effects (and also = the fact that date format strings, despite being entered correctly, are not parsed correctly by the current date formatter used to display dates in = the example columns). _____ =20 De : cldr-users-bounce@unicode.org = [mailto:cldr-users-bounce@unicode.org] De la part de Mark Davis Envoy=E9 : mardi 1 avril 2008 16:14 =C0 : CLDR list; cldr-users@unicode.org Objet : Date formats with months There is a way to make this work. We allow for two forms of every month name. The "format" style is what should occur in formatted dates, like = "14 d'abril de 2008". The month value there can contain an inflected month = or add prepositions to the month, so value for that in Catalan could be "de = X" or "d'X" depending on the month X. The date format, however, needs to be correspondingly adjusted to remove the "de ". The "stand-alone" style is what the month should look like if it occurs = by itself, such as at the top of a calendar. Mark On Tue, Apr 1, 2008 at 12:22 AM, Artur Klauser wrote: I'm wondering if there is any way to make full dates orthographically correct in Catalan. The general date format is: de de However, Catalan contracts "de" with the following word, if that word = starts with a vowel, e.g. it would be "14 de gener de 2008" but "14 d'abril de 2008". ... --=20 Mark=20 ------=_NextPart_000_033F_01C8941E.41A8A440 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
It's probably much simpler to not specify any = "stand-alone"=20 variant, and just use the same variant as with the "format" style, when = the=20 month names do not change their orthography, and then insert relevant = words with=20 "de" within the date format within single quotation = marks.
 
However the problem exposed here is that the = month name may=20 start with a vowel, and so it would contract the article in prefix; it = is=20 similar to the case where the name name takes a genitive mark and its=20 orthography change. For that case, the "format" style should be used, so = that it=20 will integrate the genitive mark or the (name-dependant) "de" or "d=92" = or similar=20 prefix.
 
The current CLDR structure currently inherits = the aliasing=20 (by default) between the "format" and "stand-alone" styles. However = this=20 aliasing was created (in the Root locale) in either direction ; this = causes=20 confusion when entering the date elements about which direction occurs = and which=20 item must be changed when only one form is needed. When sorting the = elements by=20 code, the aliased items (where the aliasing is inherited from Root) are = mixed=20 with the non-aliased items, and this does not exhibit very well the = structure of=20 this alising which becomes confusive
 
My opinion is that no aliasing between styles = should occur=20 for the "format" style which should be the default style to which the = default=20 aliases will point (the aliases will have to be deleted and changed to a = specific value in those locales that need distinct variants). I think = that this=20 could be changed automatically (no vetting needed) to swap the direction = of=20 those aliases. But be warned that this could create a chain of aliases = if the=20 current "stand-alone" style is also referenced by another style than = "format",=20 when this pair will be swapped with a modified target for the new alias; = such=20 extra aliases may exist in Root between "narrow" (limited to 1-char = only=20 and not necessarily unique), "abbreviated" and "wide" forms; as = chaining=20 aliases will have a cost at run-time, their topologic structure should = be=20 maintained as logically as possible and coherently for all locales. If = not=20 designed carefully, these chained aliases may create infinite loops (and = this=20 may be the reason why they aliases cannot be created in the CLDR Survey = tool,=20 but just be replaced by actual values).
 
I posted yesterday a bug report speaking about = these=20 effects (and also the fact that date format strings, despite being = entered=20 correctly, are not parsed correctly by the current date formatter used = to=20 display dates in the example columns).

De : = cldr-users-bounce@unicode.org=20 [mailto:cldr-users-bounce@unicode.org] De la part de Mark=20 Davis
Envoy=E9 : mardi 1 avril 2008 = 16:14
=C0 : CLDR=20 list; cldr-users@unicode.org
Objet : Date formats with=20 months

There is a way to make this work. We allow for two forms of = every=20 month name. The "format" style is what should occur in formatted = dates, like=20 "14 d'abril de 2008". The month value there can contain an inflected = month or=20 add prepositions to the month, so value for that in Catalan could be = "de X" or=20 "d'X" depending on the month X. The date format, however, needs to be=20 correspondingly adjusted to remove the "de ".

The "stand-alone" = style=20 is what the month should look like if it occurs by itself, such as at = the top=20 of a calendar.

Mark

On Tue, Apr 1, 2008 at 12:22 AM, Artur = Klauser=20 wrote:
I'm=20 wondering if there is any way to make full dates orthographically = correct in=20 Catalan.

The general date format is: <day> de = <month> de=20 <year>
However, Catalan contracts "de" with the following = word, if=20 that word starts with a vowel,
e.g. it would be "14 de gener de = 2008" but=20 "14 d'abril de 2008".

...



--=20
Mark ------=_NextPart_000_033F_01C8941E.41A8A440-- From rick@unicode.org Tue Apr 1 09:46:37 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 01 Apr 2008 09:46:37 -0600 (CST) Received: from izanami (c-71-202-247-55.hsd1.ca.comcast.net [71.202.247.55]) by unicode.org (8.12.11/8.12.11) with SMTP id m31FkaaI001923 for ; Tue, 1 Apr 2008 09:46:37 -0600 Message-Id: <200804011546.m31FkaaI001923@unicode.org> To: cldr-users@unicode.org Subject: Slow Unicode.org server yesterday (March 31) Date: Tue, 1 Apr 2008 07:46:39 -0800 From: Rick McGowan received: by Apple.Mailer (2.95.2) X-archive-position: 450 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: rick@unicode.org Precedence: bulk X-list: cldr-users Yesterday afternoon I received a report that the Unicode.org server was so slow that ordinary HTML pages were timing out. I found hundreds of suspicious open connections for IP address 41.232.132.25 repeatedly downloading one huge file. I blocked that IP address, and this resolved the problem. If you find that the server becomes much slower than usual, please feel free to report this to me and I will investigate. Rick From verdy_p@wanadoo.fr Tue Apr 1 20:36:56 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 01 Apr 2008 20:36:56 -0600 (CST) Received: from smtp23.orange.fr (smtp23.orange.fr [80.12.242.50]) by unicode.org (8.12.11/8.12.11) with ESMTP id m322at8t011908; Tue, 1 Apr 2008 20:36:56 -0600 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2349.orange.fr (SMTP Server) with ESMTP id 514771C00087; Wed, 2 Apr 2008 04:36:50 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2349.orange.fr (SMTP Server) with ESMTP id D2E881C00086; Wed, 2 Apr 2008 04:36:49 +0200 (CEST) X-ME-UUID: 20080402023649863.D2E881C00086@mwinf2349.orange.fr Reply-To: From: "Philippe Verdy" To: "'Rick McGowan'" , References: <200804011546.m31FkaaI001923@unicode.org> Subject: RE: Slow Unicode.org server yesterday (March 31) Date: Wed, 2 Apr 2008 04:36:27 +0200 Organization: Ordinateur Personnel Message-ID: <035e01c8946a$5a626060$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <200804011546.m31FkaaI001923@unicode.org> Thread-Index: AciUFH5X2bn38//JTterPvNC9BTHQQATxPoA X-archive-position: 451 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users Rick McGowan wrote: > Yesterday afternoon I received a report that the Unicode.org > server was so slow that ordinary HTML pages were timing out. > > I found hundreds of suspicious open connections for IP address > 41.232.132.25 repeatedly downloading one huge file. These repeated connections may be the result of repeated slow repsonse from the server. This may be a bug on a tool used on that Egyptian site in Cairo (but it may be caused also by repeated loose of sessions caused by its Egyptian ISP, when the user just tries to get the same file, or could be caused by a bogous downloader tool that makes excessive reppeated requests in case of server-side failure or connection loss. > I blocked that IP address, and this resolved the problem. > > If you find that the server becomes much slower than usual, > please feel free to report this to me and I will investigate. This explains why so many attempts to connect to the CLDR Survey failed. Or took minutes to validate each vote, or just display each page. I wanted to post a comment to the CLDR forum and my comment was lost as well after several attempts to resubmit it (no response from server). Anyway, the CLDR Survey is still very slow, and this has not changed a lot since the last survey period last year. I do think it is taking too much resources, or there's something wrong in the database design, and it takes too much time trying to parse and reparse again the same XML documents or regenerating them to get them in form. The XML generation for the publication should be probably be done by extracting from a more traditional RDBMS (with tables, fields and indexes) rather than with large XML documents, if this is waht it is doing. I don't understand why this is much slower to conduct the survey with very few users connected, than driving a classic web site with cheap server components like PHP and MySQL over Apache: all those websites can easily manage lots of simultaneous users, and still perform complex scripting, such as running a Wiki engine. I know that the Survey is written with server-side Java components, but certainly, it suffers a lot from very slow performance. I don't know which server you use (Oracle?), but it seems that the code generating the HTML pages from the survey application really would need some revamping. Also the XML database updates when submitting data seems to create extremely complex transactions to update lots of things, possibly in many shared XML documents, and this apparently requires regenerating the complete set of XML files, instead of just updating a few fields; this causes lots of contention due to exclusive locks, and no more than a handful of users can decently work conencted in the Survey app. Each action requires lot of patience, and we also frequently loose the HTTP session, having to logon again and revalidating the same screen: the session timeout is apparently very short, and given the response time of the server, about half of my attempts to update things fail. It's also completely impossible to work with it at some hours. The process is then extremely discouraging, and we need lots of perseverance. The bad thing is that this discouragement will have consequences: not enough vetters to accept the changes, and even the UTC members with higher privileges in the Survey don't come to update their choice (this was seen in the last survey period, and even now, they have apparently still not connected to the Survey since its recent reopening: this is visible in the number of submitted votes per item, where I am still alone in my French locale...) So the problem is not that the CLDR Survey is slower than usual, it has always been very slow! Given the incresed number of items to validate, with a single category containing more than 2100 items per locale, i.e. 27 pages containing 80 items, that require about 15 minutes of work per page, due to long reposne time and loss of sessions, reviewing this single category will require about 8 hours of continuous work for this category. Reviewing the complete set will require more then 20 hours of continuous work, i.e. about 3 weeks with a reasonnable time of not more than one hour per day on average (and if I connect to it every day, something that is not reasonnable). So we are really too far away in needed work time, and too near from the time-limit of the submission period (which is about one month in length). Last year I wanted to militate so that the Survey subission period would be continuous (permantently opened), with just the review period being limited, by working on a snapshot taken for a beta release. One solution would be to put the submission process on a separate server, using more traditional solutions: an opened catalaog of open-source XML files, and a versioning system like CVS. IT could also be hosted else where than on the Unicode site, using larger platforms made to host hundreds of open-sourced projects; the issue would be related to the copyright and licencing assignment required which may be incompatible with the CLDR copyright and licencing policy. Also many vetters are not concerned by all the CLDR data categories. It should be possible to extract several subschemas from the CLDR, and work directly on them with a simpler system that will be more efficient. More works is needed also to reorganize some very large categories, notably the local time metazones. Having to navigate in categories with hundreds or thousands of entries that are not directly related to the same schema level is really a nightmare. IT's hard to consolidate and unify the presentations and possible errors are more difficult to locate. No category should contain more than about 240 entries (3 pages), and then it should be possible to present all the items of the same category on the same page, so that we can make sure to get a coherent and complete set of usable data (incoherences and incompleteness of most categories makes the whole CLDR database unusable directly for any project without having to review it privately again before using it, so this cannot help creating a standard, and migrations from one version of the CLDR database to another is unlikely to occur easily if this additional review is too lengthy). When I look at the various entries, it's clear that each category has a structure that is currently not used as it should: there are essential entries, and many optional entries which take their values from default aliasings inherited from Root. These additional entries should be on separate subcategories. On the opposite, there are main categories that contain just a handfull of entries. The contrast is evident: the Survey tool was not designed to work on categories at various levels of the XML data tree. So the problem is then not solved. I still experience frequent loss of sessions and reviewing the same data again on large pages makes the progress really slow to perform, and it's probable that all other contributors experience the same repeated problems. The lack of reviewers has a consequence: the CLDR project is still not viable as a working effective standard for collaboration, and every site just consult it sometime to build their own locale data sets with their own uncorrected errors (as demonstrated on the Google search web site for example). From mark.edward.davis@gmail.com Thu Apr 3 08:32:42 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Thu, 03 Apr 2008 08:32:42 -0600 (CST) Received: from ti-out-0910.google.com (ti-out-0910.google.com [209.85.142.185]) by unicode.org (8.12.11/8.12.11) with ESMTP id m33EWdiO001166 for ; Thu, 3 Apr 2008 08:32:40 -0600 Received: by ti-out-0910.google.com with SMTP id 28so1339392tif.11 for ; Thu, 03 Apr 2008 07:32:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; bh=d+HUo+g4z+8TzN8uDusUPNK0fE07qxrcqcR2zKV/Wzg=; b=M7oQeIR6BzQLJsXFnq5DGkhXGANcL/MWzUnRzAOa7W1/whZdslOw5O+ygcV3RkSJhjCHOf4ylHwWnV1eN8yf7CjyTRnRDoC6hm4oCvuPYVJXDLrF3Fsl5RNwn2VkZW9IoxGmAv4WLXOiPe2moqIkX+6aUVZTA6xYOzXPfJ5E0ic= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=ID2QltwcljYkNPhg7+2EEUdKmwpNvrGnRi2B4ox0bh/FMoTsIc3cPYvcFO/u+UZ8sr1OdIdIKziwdt4ze/BzHniSi+ZMkadx6A+ZgeDInG+Fufmk3ZHpFMrultZVlg57vN6jLo61lzDAnv41kSuLmaXjKxYxswdBBRdFA3yk+x8= Received: by 10.151.106.4 with SMTP id i4mr459570ybm.248.1207233157440; Thu, 03 Apr 2008 07:32:37 -0700 (PDT) Received: by 10.150.229.9 with HTTP; Thu, 3 Apr 2008 07:32:37 -0700 (PDT) Message-ID: <30b660a20804030732g44af2f7erf41a264104bba46@mail.gmail.com> Date: Thu, 3 Apr 2008 07:32:37 -0700 From: "Mark Davis" To: verdy_p@wanadoo.fr Subject: Re: Slow Unicode.org server yesterday (March 31) Cc: "Rick McGowan" , cldr-users@unicode.org In-Reply-To: <035e01c8946a$5a626060$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1813_12498859.1207233157425" References: <200804011546.m31FkaaI001923@unicode.org> <035e01c8946a$5a626060$0a01a8c0@HARNON> X-Google-Sender-Auth: c1d43c6fda03fa60 X-archive-position: 452 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: mark.davis@icu-project.org Precedence: bulk X-list: cldr-users ------=_Part_1813_12498859.1207233157425 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline We realize that the survey tool is slow -- certainly slower than we would like. And the speed will often be frustrating for people; we ask for your patience when using the tool. Whenever it seems particularly unresponsive, also please send a message to rick@unicode.org . The code does use pretty standard components such as MySQL for a database. Behind the scenes, however, the tool is doing a lot of work -- for example, whenever you submit a new item, it is doing a full cross check of all items in the locale, running the tests against every item -- so as to make sure that there are no conflicts where items might depend on one another. Clearly this could be more optimized, but there are always trade-offs in terms of resources. This is an open-source project, and all of the people developing and supporting the code are doing this in addition to their regular jobs -- just getting ready for this release took a lot of long nights and weekends. The code is visible in CVS (and could be downloaded and run, if you are adventurous), and if you'd like to look it over we'd always appreciate feedback and suggestions. However, this mailing list is not the forum for such a technical discussion; please file any such suggestions as bugs on the site, and the right person will follow up with you. Mark On Tue, Apr 1, 2008 at 7:36 PM, Philippe Verdy wrote: > Rick McGowan wrote: > > Yesterday afternoon I received a report that the Unicode.org > > server was so slow that ordinary HTML pages were timing out. > > > > I found hundreds of suspicious open connections for IP address > > 41.232.132.25 repeatedly downloading one huge file. > > These repeated connections may be the result of repeated slow repsonse > from > the server. This may be a bug on a tool used on that Egyptian site in > Cairo > (but it may be caused also by repeated loose of sessions caused by its > Egyptian ISP, when the user just tries to get the same file, or could be > caused by a bogous downloader tool that makes excessive reppeated requests > in case of server-side failure or connection loss. > > > I blocked that IP address, and this resolved the problem. > > > > If you find that the server becomes much slower than usual, > > please feel free to report this to me and I will investigate. > > This explains why so many attempts to connect to the CLDR Survey failed. > > Or took minutes to validate each vote, or just display each page. > > I wanted to post a comment to the CLDR forum and my comment was lost as > well > after several attempts to resubmit it (no response from server). > > Anyway, the CLDR Survey is still very slow, and this has not changed a lot > since the last survey period last year. I do think it is taking too much > resources, or there's something wrong in the database design, and it takes > too much time trying to parse and reparse again the same XML documents or > regenerating them to get them in form. The XML generation for the > publication should be probably be done by extracting from a more > traditional > RDBMS (with tables, fields and indexes) rather than with large XML > documents, if this is waht it is doing. > > I don't understand why this is much slower to conduct the survey with very > few users connected, than driving a classic web site with cheap server > components like PHP and MySQL over Apache: all those websites can easily > manage lots of simultaneous users, and still perform complex scripting, > such > as running a Wiki engine. > > I know that the Survey is written with server-side Java components, but > certainly, it suffers a lot from very slow performance. I don't know which > server you use (Oracle?), but it seems that the code generating the HTML > pages from the survey application really would need some revamping. Also > the > XML database updates when submitting data seems to create extremely > complex > transactions to update lots of things, possibly in many shared XML > documents, and this apparently requires regenerating the complete set of > XML > files, instead of just updating a few fields; this causes lots of > contention > due to exclusive locks, and no more than a handful of users can decently > work conencted in the Survey app. > > Each action requires lot of patience, and we also frequently loose the > HTTP > session, having to logon again and revalidating the same screen: the > session > timeout is apparently very short, and given the response time of the > server, > about half of my attempts to update things fail. It's also completely > impossible to work with it at some hours. > > The process is then extremely discouraging, and we need lots of > perseverance. The bad thing is that this discouragement will have > consequences: not enough vetters to accept the changes, and even the UTC > members with higher privileges in the Survey don't come to update their > choice (this was seen in the last survey period, and even now, they have > apparently still not connected to the Survey since its recent reopening: > this is visible in the number of submitted votes per item, where I am > still > alone in my French locale...) > > So the problem is not that the CLDR Survey is slower than usual, it has > always been very slow! > Given the incresed number of items to validate, with a single category > containing more than 2100 items per locale, i.e. 27 pages containing 80 > items, that require about 15 minutes of work per page, due to long reposne > time and loss of sessions, reviewing this single category will require > about > 8 hours of continuous work for this category. Reviewing the complete set > will require more then 20 hours of continuous work, i.e. about 3 weeks > with > a reasonnable time of not more than one hour per day on average (and if I > connect to it every day, something that is not reasonnable). > > So we are really too far away in needed work time, and too near from the > time-limit of the submission period (which is about one month in length). > Last year I wanted to militate so that the Survey subission period would > be > continuous (permantently opened), with just the review period being > limited, > by working on a snapshot taken for a beta release. > > One solution would be to put the submission process on a separate server, > using more traditional solutions: an opened catalaog of open-source XML > files, and a versioning system like CVS. IT could also be hosted else > where > than on the Unicode site, using larger platforms made to host hundreds of > open-sourced projects; the issue would be related to the copyright and > licencing assignment required which may be incompatible with the CLDR > copyright and licencing policy. > > Also many vetters are not concerned by all the CLDR data categories. It > should be possible to extract several subschemas from the CLDR, and work > directly on them with a simpler system that will be more efficient. > > More works is needed also to reorganize some very large categories, > notably > the local time metazones. Having to navigate in categories with hundreds > or > thousands of entries that are not directly related to the same schema > level > is really a nightmare. IT's hard to consolidate and unify the > presentations > and possible errors are more difficult to locate. No category should > contain > more than about 240 entries (3 pages), and then it should be possible to > present all the items of the same category on the same page, so that we > can > make sure to get a coherent and complete set of usable data (incoherences > and incompleteness of most categories makes the whole CLDR database > unusable > directly for any project without having to review it privately again > before > using it, so this cannot help creating a standard, and migrations from one > version of the CLDR database to another is unlikely to occur easily if > this > additional review is too lengthy). > > When I look at the various entries, it's clear that each category has a > structure that is currently not used as it should: there are essential > entries, and many optional entries which take their values from default > aliasings inherited from Root. These additional entries should be on > separate subcategories. On the opposite, there are main categories that > contain just a handfull of entries. The contrast is evident: the Survey > tool > was not designed to work on categories at various levels of the XML data > tree. > > So the problem is then not solved. I still experience frequent loss of > sessions and reviewing the same data again on large pages makes the > progress > really slow to perform, and it's probable that all other contributors > experience the same repeated problems. The lack of reviewers has a > consequence: the CLDR project is still not viable as a working effective > standard for collaboration, and every site just consult it sometime to > build > their own locale data sets with their own uncorrected errors (as > demonstrated on the Google search web site for example). > > > > -- Mark ------=_Part_1813_12498859.1207233157425 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline We realize that the survey tool is slow -- certainly slower than we would like. And the speed will often be frustrating for people; we ask for your patience when using the tool. Whenever it seems particularly unresponsive, also please send a message to rick@unicode.org .

The code does use pretty standard components such as MySQL for a database. Behind the scenes, however, the tool is doing a lot of work -- for example, whenever you submit a new item, it is doing a full cross check of all items in the locale, running the tests against every item -- so as to make sure that there are no conflicts where items might depend on one another. Clearly this could be more optimized, but there are always trade-offs in terms of resources. This is an open-source project, and all of the people developing and supporting the code are doing this in addition to their regular jobs -- just getting ready for this release took a lot of long nights and weekends. The code is visible in CVS (and could be downloaded and run, if you are adventurous), and if you'd like to look it over we'd always appreciate feedback and suggestions.

However, this mailing list is not the forum for such a technical discussion; please file any such suggestions as bugs on the site, and the right person will follow up with you.

Mark

On Tue, Apr 1, 2008 at 7:36 PM, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
Rick McGowan wrote:
> Yesterday afternoon I received a report that the Unicode.org
> server was so slow that ordinary HTML pages were timing out.
>
> I found hundreds of suspicious open connections for IP address
> 41.232.132.25 repeatedly downloading one huge file.

These repeated connections may be the result of repeated slow repsonse from
the server. This may be a bug on a tool used on that Egyptian site in Cairo
(but it may be caused also by repeated loose of sessions caused by its
Egyptian ISP, when the user just tries to get the same file, or could be
caused by a bogous downloader tool that makes excessive reppeated requests
in case of server-side failure or connection loss.

> I blocked that IP address, and this resolved the problem.
>
> If you find that the server becomes much slower than usual,
> please feel free to report this to me and I will investigate.

This explains why so many attempts to connect to the CLDR Survey failed.

Or took minutes to validate each vote, or just display each page.

I wanted to post a comment to the CLDR forum and my comment was lost as well
after several attempts to resubmit it (no response from server).

Anyway, the CLDR Survey is still very slow, and this has not changed a lot
since the last survey period last year. I do think it is taking too much
resources, or there's something wrong in the database design, and it takes
too much time trying to parse and reparse again the same XML documents or
regenerating them to get them in form. The XML generation for the
publication should be probably be done by extracting from a more traditional
RDBMS (with tables, fields and indexes) rather than with large XML
documents, if this is waht it is doing.

I don't understand why this is much slower to conduct the survey with very
few users connected, than driving a classic web site with cheap server
components like PHP and MySQL over Apache: all those websites can easily
manage lots of simultaneous users, and still perform complex scripting, such
as running a Wiki engine.

I know that the Survey is written with server-side Java components, but
certainly, it suffers a lot from very slow performance. I don't know which
server you use (Oracle?), but it seems that the code generating the HTML
pages from the survey application really would need some revamping. Also the
XML database updates when submitting data seems to create extremely complex
transactions to update lots of things, possibly in many shared XML
documents, and this apparently requires regenerating the complete set of XML
files, instead of just updating a few fields; this causes lots of contention
due to exclusive locks, and no more than a handful of users can decently
work conencted in the Survey app.

Each action requires lot of patience, and we also frequently loose the HTTP
session, having to logon again and revalidating the same screen: the session
timeout is apparently very short, and given the response time of the server,
about half of my attempts to update things fail. It's also completely
impossible to work with it at some hours.

The process is then extremely discouraging, and we need lots of
perseverance. The bad thing is that this discouragement will have
consequences: not enough vetters to accept the changes, and even the UTC
members with higher privileges in the Survey don't come to update their
choice (this was seen in the last survey period, and even now, they have
apparently still not connected to the Survey since its recent reopening:
this is visible in the number of submitted votes per item, where I am still
alone in my French locale...)

So the problem is not that the CLDR Survey is slower than usual, it has
always been very slow!
Given the incresed number of items to validate, with a single category
containing more than 2100 items per locale, i.e. 27 pages containing 80
items, that require about 15 minutes of work per page, due to long reposne
time and loss of sessions, reviewing this single category will require about
8 hours of continuous work for this category. Reviewing the complete set
will require more then 20 hours of continuous work, i.e. about 3 weeks with
a reasonnable time of not more than one hour per day on average (and if I
connect to it every day, something that is not reasonnable).

So we are really too far away in needed work time, and too near from the
time-limit of the submission period (which is about one month in length).
Last year I wanted to militate so that the Survey subission period would be
continuous (permantently opened), with just the review period being limited,
by working on a snapshot taken for a beta release.

One solution would be to put the submission process on a separate server,
using more traditional solutions: an opened catalaog of open-source XML
files, and a versioning system like CVS. IT could also be hosted else where
than on the Unicode site, using larger platforms made to host hundreds of
open-sourced projects; the issue would be related to the copyright and
licencing assignment required which may be incompatible with the CLDR
copyright and licencing policy.

Also many vetters are not concerned by all the CLDR data categories. It
should be possible to extract several subschemas from the CLDR, and work
directly on them with a simpler system that will be more efficient.

More works is needed also to reorganize some very large categories, notably
the local time metazones. Having to navigate in categories with hundreds or
thousands of entries that are not directly related to the same schema level
is really a nightmare. IT's hard to consolidate and unify the presentations
and possible errors are more difficult to locate. No category should contain
more than about 240 entries (3 pages), and then it should be possible to
present all the items of the same category on the same page, so that we can
make sure to get a coherent and complete set of usable data (incoherences
and incompleteness of most categories makes the whole CLDR database unusable
directly for any project without having to review it privately again before
using it, so this cannot help creating a standard, and migrations from one
version of the CLDR database to another is unlikely to occur easily if this
additional review is too lengthy).

When I look at the various entries, it's clear that each category has a
structure that is currently not used as it should: there are essential
entries, and many optional entries which take their values from default
aliasings inherited from Root. These additional entries should be on
separate subcategories. On the opposite, there are main categories that
contain just a handfull of entries. The contrast is evident: the Survey tool
was not designed to work on categories at various levels of the XML data
tree.

So the problem is then not solved. I still experience frequent loss of
sessions and reviewing the same data again on large pages makes the progress
really slow to perform, and it's probable that all other contributors
experience the same repeated problems. The lack of reviewers has a
consequence: the CLDR project is still not viable as a working effective
standard for collaboration, and every site just consult it sometime to build
their own locale data sets with their own uncorrected errors (as
demonstrated on the Google search web site for example).






--
Mark ------=_Part_1813_12498859.1207233157425-- From verdy_p@wanadoo.fr Thu Apr 3 18:32:56 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Thu, 03 Apr 2008 18:32:57 -0600 (CST) Received: from smtp23.orange.fr (smtp23.orange.fr [80.12.242.50]) by unicode.org (8.12.11/8.12.11) with ESMTP id m340WtpH012347; Thu, 3 Apr 2008 18:32:56 -0600 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2346.orange.fr (SMTP Server) with ESMTP id 161AD1C0008D; Fri, 4 Apr 2008 02:32:50 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2346.orange.fr (SMTP Server) with ESMTP id 1D6B11C00084; Fri, 4 Apr 2008 02:32:47 +0200 (CEST) X-ME-UUID: 20080404003247120.1D6B11C00084@mwinf2346.orange.fr Reply-To: From: "Philippe Verdy" To: "'CLDR list'" , References: <30b660a20804010714j26eb8143n8e5ce5f2c9a9b5cc@mail.gmail.com> <033e01c8940d$7e1fd440$0a01a8c0@HARNON> Subject: RE: Date formats with months : Chinese calendar issues Date: Fri, 4 Apr 2008 02:32:21 +0200 Organization: Ordinateur Personnel Message-ID: <03ea01c895eb$595d7790$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_03EB_01C895FC.1CE64790" X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <033e01c8940d$7e1fd440$0a01a8c0@HARNON> Thread-Index: AciUCHfDsENxWCZuS1etL67KpxFS1wAAEazAAHX/AKA= X-archive-position: 453 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users This is a multi-part message in MIME format. ------=_NextPart_000_03EB_01C895FC.1CE64790 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I'd like the confiramtion that the CLDR data for the so called "Chinese" calendar is effectively using one of the two wellknown calendars: * the Chinese Solar calendar (aka CHS): it has 12 equal months per Solar year (subdivided in two halves). I think that for modern use, the months = are distributed according to the "True Sun" (i.e. with regular angle across = the year, so that winter months, near the perihely, are smaller than summer months where the Earth is more distant from the Sun and so its orbital angular speed is slower); these months are then about 30 or 31 days each (they are quite similar to the Gregorian months, however they start from = the vernal equinox generally observed in Beijing); the main difference with = the Gregorian Calendar being in the way years are counted and with the epoch (I'll go on this subject below). * the Chinese Lunar calendar (aka CHL): it has 12 or 13 months per = Chinese year; these years start so that the vernal equinox in Beijing is = observed during the first month, if I understand the CHL calendar correctly, but = the main definition of these months is that they always start on a new black moon. Every 5 Solar years, there will be 61 or 62 new moons, so there = will be 61 or 62 months. Months are named from 1 to 12, but to keep the lunar calendar aligned with Solar years, 1 or 2 intercalary months will be = added within 5 years; when this occurs, the intercalary month has the same = name (or number) as the previous month. To make a distinction, a leap = indicator must be added to qualify the second intercalary month. In short = notations, this is indicated by adding an asterisk after the month number between 1 = and 12. These months will have 29 or 30 jours each, so day of months are numbered differently. =20 When looking at what the CLDR currently computes, it seems that it effectively computes the Chinese Lunar calendar. If so, the months used = in date format need a way to specify where and how the intercalary leap = months are designated. For now, it contains date formats using "M" (for numeric on one or two digits without leading zero) or "MM" (for numeric with two digits) or = "MMM" (for month abbreviations) or "MMMM" (for the full month names). Are = these date format specifiers supposed to add the leap month indicator automatically? If so, which indication ("*" for numeric formats, " leap" adjective or abbreviation for other month specifiers) will be used? =20 I've seen that some formats contain an additional lower case letter L = after M or MM or MMM or MMMM: is it effectively this additional letter that represents the leap month indicator? How can it be localised? Is it abbreviatable? =20 I think that the "l" in "Ml" or "MMl" is used for the leap month = indicator for numeric months: this would be typically an asterisk or plus sign if = this is a leap month, but it could also be a space or an hyphen (instead of nothing) in the "MMl" format that will be typically used to get = fixed-length numeric date formats. But this could also be a translatable resource for = the leap indicator in "narrow" format (in the Han script, this is just a = single ideograph aded after the ideograph for the Solar month name i.e. those = whose name are ending in "-yu=E8" when written in Pinyin). =20 I think that the "l" in "MMMl" (month name in "abbreviated" style) or "MMMMl" (month name in "wide" style) would use the translatable "leap" indicator, but it could also have a "abbreviated" style or "wide" style (they may be needed even in Chinese when writing with Pinyin, but also = in other languages where a word like "leap" exist). So these leap = indicators will need localisable style. =20 I can't see any resource in the CLDR for describing/translating the Leap month indicator separately, and not even any resource for translating = the pair (month number,leap indicator), such as "1*" into the relevant = string (the CLDR contains only resources for Cjhinese months "1" to "12", but = none for "1*" to "12*"...) =20 Without any clear way to format Chinese dates correctly, no date can be = used reliably, as they become ambiguous if just the month name or month = number is indicated, during the two months where an intercalary month occurs (this occurs 1 or 2 times every 5 years, so this is not a rare case!) My = opinion is that this case was forgotten, by assuming incorrectly that the = Chinese calendar used always 12 months, or by making confusions between the = Chinese Lunar Calendar and the Chinese Solar calendar. =20 Note that nothing indicates in the CLDR which calendar is used: there's = no confusion in Chinese because the month of the two systems don't have the same names (and day of months are then different). =20 Consequence: Chinese date formats using "M" or "MM" or "MMM" or "MMMM" without appending a "l" are all incorrect (or at least ambiguous). =20 Now let's speak about years: =20 Independantly of the two calendars, the years are synchronized and = counted the same way (but traditionally, the Chinese New Year is celebrated according to the first month of the Chinese Lunar calendar): years are counted from 1 to 60 within a cycle of 60 years; then cycles may be = counted from an epoch but visibly, the CLDR seems to use the era to display the cycle number (currently this is the 78th cycle, but some traditions gindicate this is the 79th due to a disagreement with the date of reign = of the mythic "Yellow Emperor" from which the calendar is said to be originated. However both traditions are using the same synchronisation = for their cycles. =20 Note that the 60 years in a cycle also have a name, which is composed of = one of 10 elements and one of 12 animals: each year, both the element and = animal are incremented cyclicly; this naming associating one of 10 elements and = one of 12 animals is also used for assigning names to days cyclicly, without interruption, but I'll let this beside. The only important thing for now = is to know that the "y" or "yy" date specifier is referencing the year = number (in the 60-years cycle), so it varies from 1 to 60. After this, a new = cycle is started. Currently, the CLDR seems to compute the cycle generration number and display it when using using a "G" date specifier in a date format. =20 So a full year in the Chinese is will need the format: "y'x'G" or = "yy'x'G" (where 'x' is some literal separator, the needed number of 'y' cannot be larger than 2 as the numbers are between 1 and 60; the number of 'G' = cannot exceed 2 digits for now; you can use "GG" if you were talking about very = old dates during the 6 centuries after the supposed reign of the mythic = "Yellow Emperor", but this very far before the Christian era and there's no authoritative sources for dating anything precisely). Note that the "G" specifier will often be removed from abbreviated years due to the disagreement about the way the cycles are numbered. In addition, the continous cycle generations numbering may also be replaced by using the = name of the Emperor in reign (very few emperors have reigned for more than = one full 60-years cycles, when this occured, they may have anticipited the = name of their successor). So "G" or "GG" is used like an era, however in the Chinese calendar they are numeric (I suppose that to use the short name = or full name of the Emperor, you'd need to use "GGG" or "GGGG", to = designate the "abbreviated" or "wide" styles for eras, otherwise, "G" and "GG" = will just be continous generation number from a very old epoch). =20 There are lots of suppositions above. But these translate into questions about which data should be inserted in the CLDR Survey for the Chinese calendar, and which formats are valid! What I can see is that: * Chinese date formats that specify "M" or "MM" or "MMM" or "MMMM" alone will be ambiguous, the "l" needs to be added. * Leap month indicators are not clearly translatable or transcriptible. * Years in date formats that specify "yyyy" are not complete, but will display two extra/unneeded zeroes before the two-digits year number = (within the 60-year cycle). Adding "G" or "GG" will be necessary for writing = full dates with precise years: replace "yyyy" by "yy'x'G'" or something = similar for the "long" or "medium" style. Adding a "G" after "y" or "yy" is optional, but not writing it will create something like the "Y2K bug" in applications (when the century is not specified; omitting it will not be enough to span the lifetime of a now very vast population in the world: = it can't be omitted from dates of births for example, but this may be = possible in articles speaking about news and expected events, but not in articles speaking about history). =20 I'd like your opinion about this, and how the CLDR project wants to = manage the Chinese calendar, and how the applications using this data are = supposed to work this the CLDR data for this calendar. =20 ------=_NextPart_000_03EB_01C895FC.1CE64790 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
I'd like the confiramtion that the CLDR data = for the so=20 called "Chinese" calendar is effectively using one of the two wellknown=20 calendars:
* the Chinese Solar calendar (aka CHS): it has = 12 equal=20 months per Solar year (subdivided in two halves). I think that for = modern use,=20 the months are distributed according to the "True Sun" (i.e. with = regular angle=20 across the year, so that winter months, near the perihely, are = smaller than=20 summer months where the Earth is more distant from the Sun and so its = orbital=20 angular speed is slower); these months are then about 30 or 31 days = each=20 (they are quite similar to the Gregorian months, however they start from = the=20 vernal equinox generally observed in Beijing); the main difference with = the=20 Gregorian Calendar being in the way years are counted and with the epoch = (I'll=20 go on this subject below).
* the Chinese Lunar calendar (aka CHL): it has = 12 or 13=20 months per Chinese year; these years start so that the vernal equinox in = Beijing=20 is observed during the first month, if I understand the CHL calendar = correctly,=20 but the main definition of these months is that they always start on a = new black=20 moon. Every 5 Solar years, there will be 61 or 62 new moons, so there = will be 61=20 or 62 months. Months are named from 1 to 12, but to keep the lunar = calendar=20 aligned with Solar years, 1 or 2 intercalary months will be added within = 5=20 years; when this occurs, the intercalary month has the same name (or = number) as=20 the previous month. To make a distinction, a leap indicator must be = added to=20 qualify the second intercalary month. In short notations, this is = indicated by=20 adding an asterisk after the month number between 1 and 12. These months = will=20 have 29 or 30 jours each, so day of months are numbered=20 differently.
 
When looking at what the CLDR currently = computes, it seems=20 that it effectively computes the Chinese Lunar calendar. If so, the = months used=20 in date format need a way to specify where and how the intercalary leap = months=20 are designated.
For now, it contains date formats using "M" = (for numeric on=20 one or two digits without leading zero) or "MM" (for numeric with two = digits) or=20 "MMM" (for month abbreviations) or "MMMM" (for the full month names). = Are these=20 date format specifiers supposed to add the leap month indicator = automatically?=20 If so, which indication ("*" for numeric formats, " leap" = adjective or=20 abbreviation for other month specifiers) will be = used?
 
I've seen that some formats contain an = additional lower=20 case letter L after M or MM or MMM or MMMM: is it effectively this = additional=20 letter that represents the leap month indicator? How can it be = localised? Is it=20 abbreviatable?
 
I think that the "l" in "Ml" or "MMl" is used = for the leap=20 month indicator for numeric months: this would be typically an asterisk = or plus=20 sign if this is a leap month, but it could also be a space or an hyphen = (instead=20 of nothing) in the "MMl" format that will be typically used to get = fixed-length=20 numeric date formats. But this could also be a translatable resource for = the=20 leap indicator in "narrow" format (in the Han script, this is just a = single=20 ideograph aded after the ideograph for the Solar month name i.e.=20 those whose name are ending in "-yu=E8" when written in=20 Pinyin).
 
I think that the "l" in "MMMl" (month name in = "abbreviated"=20 style) or "MMMMl" (month name in "wide" style) would use the = translatable "leap"=20 indicator, but it could also have a "abbreviated" style or "wide" style = (they=20 may be needed even in Chinese when writing with Pinyin, but also in = other=20 languages where a word like "leap" exist). So these leap = indicators=20 will need localisable style.
 
I can't see any resource in the CLDR for=20 describing/translating the Leap month indicator separately, and not even = any=20 resource for translating the pair (month number,leap indicator), such as = "1*"=20 into the relevant string (the CLDR contains only resources for Cjhinese = months=20 "1" to "12", but none for "1*" to "12*"...)
 
Without any clear way to format Chinese dates = correctly, no=20 date can be used reliably, as they become ambiguous if just the month = name or=20 month number is indicated, during the two months where an intercalary = month=20 occurs (this occurs 1 or 2 times every 5 years, so this is not a rare = case!) My=20 opinion is that this case was forgotten, by assuming incorrectly that = the=20 Chinese calendar used always 12 months, or by making confusions between = the=20 Chinese Lunar Calendar and the Chinese Solar = calendar.
 
Note that nothing indicates in the CLDR which = calendar is=20 used: there's no confusion in Chinese because the month of the two = systems don't=20 have the same names (and day of months are then = different).
 
Consequence: Chinese date formats using "M" or = "MM" or=20 "MMM" or "MMMM" without appending a "l" are all incorrect (or at least=20 ambiguous).
 
Now let's speak about = years:
 
Independantly of the two calendars, the years = are=20 synchronized and counted the same way (but traditionally, the Chinese = New Year=20 is celebrated according to the first month of the Chinese Lunar = calendar): years=20 are counted from 1 to 60 within a cycle of 60 years; then cycles may be = counted=20 from an epoch but visibly, the CLDR seems to use the era to display the = cycle=20 number (currently this is the 78th cycle, but some traditions gindicate = this is=20 the 79th due to a disagreement with the date of reign of the mythic = "Yellow=20 Emperor" from which the calendar is said to be originated. However both=20 traditions are using the same synchronisation for their=20 cycles.
 
Note that the 60 years in a cycle also have a = name, which=20 is composed of one of 10 elements and one of 12 animals: each = year,=20 both the element and animal are incremented cyclicly; this naming = associating=20 one of 10 elements and one of 12 animals is also used for assigning = names to=20 days cyclicly, without interruption, but I'll let this beside. The = only=20 important thing for now is to know that the "y" or "yy" date specifier = is=20 referencing the year number (in the 60-years cycle), so it varies from 1 = to 60.=20 After this, a new cycle is started. Currently, the CLDR seems to compute = the=20 cycle generration number and display it when using using a "G" date = specifier in=20 a date format.
 
So a full year in the Chinese is will need the = format:=20 "y'x'G" or "yy'x'G" (where 'x' is some literal separator, the needed = number of=20 'y' cannot be larger than 2 as the numbers are between 1 and 60; the = number of=20 'G' cannot exceed 2 digits for now; you can use "GG" if you were talking = about=20 very old dates during the 6 centuries after the supposed reign of the = mythic=20 "Yellow Emperor", but this very far before the Christian era and there's = no=20 authoritative sources for dating anything precisely). Note that the "G"=20 specifier will often be removed from abbreviated years due to the = disagreement=20 about the way the cycles are numbered. In addition, the continous cycle=20 generations numbering may also be replaced by using the name of the = Emperor in=20 reign (very few emperors have reigned for more than one full 60-years = cycles,=20 when this occured, they may have anticipited the name of their = successor). So=20 "G" or "GG" is used like an era, however in the Chinese calendar they = are=20 numeric (I suppose that to use the short name or full name of the = Emperor, you'd=20 need to use "GGG" or "GGGG", to designate the "abbreviated" or = "wide"=20 styles for eras, otherwise, "G" and "GG" will just be continous = generation=20 number from a very old epoch).
 
There are lots of suppositions above. But these = translate=20 into questions about which data should be inserted in the CLDR Survey = for the=20 Chinese calendar, and which formats are valid! What I can see is=20 that:
* Chinese date formats that specify "M" or "MM" = or "MMM" or=20 "MMMM" alone will be ambiguous, the "l" needs to be = added.
* Leap month indicators are not clearly = translatable or=20 transcriptible.
* Years in date formats that specify "yyyy" are = not=20 complete, but will display two extra/unneeded zeroes before the = two-digits year=20 number (within the 60-year cycle). Adding "G" or "GG" will be necessary = for=20 writing full dates with precise years: replace "yyyy" by "yy'x'G'" or = something=20 similar for the "long" or "medium" style. Adding a "G" after "y" or "yy" = is=20 optional, but not writing it will create something like the "Y2K bug" in = applications (when the century is not specified; omitting it will = not be=20 enough to span the lifetime of a now very vast population in the world: = it can't=20 be omitted from dates of births for example, but this may be possible in = articles speaking about news and expected events, but not in articles = speaking=20 about history).
 
I'd like your opinion about this, and how the = CLDR project=20 wants to manage the Chinese calendar, and how the applications using = this data=20 are supposed to work this the CLDR data for this = calendar.
 
= ------=_NextPart_000_03EB_01C895FC.1CE64790-- From rick@unicode.org Fri Apr 4 16:55:05 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Fri, 04 Apr 2008 17:01:07 -0600 (CST) Received: from izanami (c-71-202-247-55.hsd1.ca.comcast.net [71.202.247.55]) by unicode.org (8.12.11/8.12.11) with SMTP id m34Msxxk028084; Fri, 4 Apr 2008 16:54:59 -0600 Message-Id: <200804042254.m34Msxxk028084@unicode.org> To: unicode@unicode.org Subject: Unicode 5.1 Released Date: Fri, 4 Apr 2008 14:54:58 -0800 From: Rick McGowan received: by Apple.Mailer (2.95.2) X-archive-position: 454 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: rick@unicode.org Precedence: bulk X-list: cldr-users The Unicode Consortium is pleased to announce the release of Unicode 5.1. This release contains over 100,000 characters, and provides significant additions and improvements that extend text processing for software worldwide. Some of the key features are: increased security in data exchange, significant character additions for Indic and South East Asian scripts, expanded identifier specifications for Indic and Arabic scripts, improvements in the processing of Tamil and other Indic scripts, linebreaking conformance relaxation for HTML and other protocols, strengthened normalization stability, new case pair stability, plus others given below. The Version 5.1.0 data files and documentation are final and posted on the Unicode site. In addition to updated existing files, implementers will find new test data files (for example, for linebreaking) and new XML data files that encapsulate all of the Unicode character properties. For details, see the page for Unicode 5.1.0 at http://www.unicode.org/versions/Unicode5.1.0/. A major feature of Unicode 5.1.0 is the enabling of ideographic variation sequences. These sequences allow standardized representation of glyphic variants needed for Japanese, Chinese, and Korean text. The first registered collection, from Adobe Systems, is now available at http://www.unicode.org/ivd/. Unicode 5.1 contains significant changes to properties and behaviorial specifications. Several important property definitions were extended, improving linebreaking for Polish and Portuguese hyphenation. The Unicode Text Segmentation Algorithms, covering sentences, words, and characters, were greatly enhanced to improve the processing of Tamil and other Indic languages. The Unicode Normalization Algorithm now defines stabilized strings and provides guidelines for buffering. Standardized named sequences are added for Lithuanian, and provisional named sequences for Tamil. Unicode 5.1.0 adds 1,624 newly encoded characters. These additions include characters required for Malayalam and Myanmar and important individual characters such as Latin capital sharp s for German. Version 5.1 extends support for languages in Africa, India, Indonesia, Myanmar, and Vietnam, with the addition of the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai scripts. Scholarly support includes important editorial punctuation marks, as well as the Carian, Lycian, and Lydian scripts, and the Phaistos disc symbols. Other new symbol sets include dominoes, Mahjong, dictionary punctuation marks, and math additions. This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC 10646:2003 plus Amendments 1 through 4. The Unicode Collation Algorithm (UCA), the core standard for sorting all text, is also being updated at the same time (see http://www.unicode.org/reports/tr10/). The major changes in UCA include coverage of all Unicode 5.1 characters, tightened conformance for canonical equivalence, clearer definitions of internationalized search and matching, specifications of parameters for customizing collation, and definitions of collation folding. There are also important clarifications on the use of contractions (such as "ch" in Slovak) in collation. The next version of the Unicode locale project (CLDR) is also being prepared on the basis of Unicode 5.1, and is now open for public data submission (see http://www.unicode.org/cldr/). From verdy_p@wanadoo.fr Sat Apr 5 12:52:41 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Sat, 05 Apr 2008 12:52:41 -0600 (CST) Received: from smtp23.orange.fr (smtp23.orange.fr [193.252.22.126]) by unicode.org (8.12.11/8.12.11) with ESMTP id m35IqeXL012057; Sat, 5 Apr 2008 12:52:41 -0600 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2317.orange.fr (SMTP Server) with ESMTP id 9FF0F7000089; Sat, 5 Apr 2008 20:52:34 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2317.orange.fr (SMTP Server) with ESMTP id 243907000083; Sat, 5 Apr 2008 20:52:34 +0200 (CEST) X-ME-UUID: 20080405185234148.243907000083@mwinf2317.orange.fr Reply-To: From: "Philippe Verdy" To: "'Mark Davis'" Cc: "'Rick McGowan'" , References: <200804011546.m31FkaaI001923@unicode.org> <035e01c8946a$5a626060$0a01a8c0@HARNON> <30b660a20804030732g44af2f7erf41a264104bba46@mail.gmail.com> Subject: RE: Slow Unicode.org server yesterday (March 31) Date: Sat, 5 Apr 2008 20:52:04 +0200 Organization: Ordinateur Personnel Message-ID: <043f01c8974e$2456a300$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <30b660a20804030732g44af2f7erf41a264104bba46@mail.gmail.com> Thread-Index: AciVm5J6CSdFVvYRRsW0Bsta41LxfwBsAHlQ X-archive-position: 455 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users De : cldr-users-bounce@unicode.org [mailto:cldr-users-bounce@unicode.org] De la part de Mark Davis > The code does use pretty standard components such as MySQL for a database. I suspect that the tool may be very slow as it may run on the same host than the one used for the whole Unicode web site,, and the MySQL server may be accessible from a network link that is shared with the Internet traffic received or generated by the main Apache server running the whole website. As the tool seems to perform many connections to the MySQL instance, it may also happen that it is constantly opening/closing sessions to it, and this adds to the needed time to validate each sinbgle item (currently about 2 minutes for each single validation, that is really excessive, and most of the time is probably spent just by opening/closing SQL sessions, or performing queries. Even a single user trying to validate or just cast a vote for all existing items in a single locale will not be able to do that on the full volume within the timeframe of one month. The system also does not scale: if there are two users, the time is just multiplied by 2 or more. To validate completely about 40 locales with at least 3 or 4 vetters per locale, it will require 150-200 times more, so it would be impossible to complete the survey in one month with the current solution. At least you've realized it, because now the survey has been closed. Really, a change in architecture is needed, and the survey will need to be postponed, until a dramatic improvement is made to the code (because I still think that it is performing too much checks every time, for items that are not even related: why does the validation of a Coptic calendar entry needs to perform a full check for all calendars and formats????) and its supporting architecture (location of Apache/Tomcat/MySQL, IP routing, proxying, tuning the needed memory...). My opinion is that, even if the tool is effectively performing lots of queries behind the scene, it should not take so much time. So the problem seems to be in its physical deployment. Did you investigate the possibility to run the Survey Tool on its own server, possibly behind the www.unicode.org's Apache main server that would just act as a proxy, so that the Java code would run there, on the same machine as the MySQL engine without using any connection through a overloaded network link? If installing a proxy in the main Apache server is too complicate, why not then creating a separate domain name for it (if you have another public IP available for it that is configured behing the site's firewall), and so the link to the survey would just redirect to the new separate server used for conducting surveys (say: survey.unicode.org/cldr/) > Behind the scenes, however, the tool is doing a lot of work -- for example, whenever you submit a new item, it is doing a full cross check of all items in the locale, running the tests against every item -- so as to make sure that there are no conflicts where items might depend on one another. Clearly this could be more optimized, but there are always trade-offs in terms of resources. This is an open-source project, and all of the people developing and supporting the code are doing this in addition to their regular jobs -- just getting ready for this release took a lot of long nights and weekends. The code is visible in CVS (and could be downloaded and run, if you are adventurous), and if you'd like to look it over we'd always appreciate feedback and suggestions. I've not seen any address where this CVS repository is visible... Nothing written in the CLDR web site about it. From verdy_p@wanadoo.fr Sat Apr 5 22:15:26 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Sat, 05 Apr 2008 22:15:26 -0600 (CST) Received: from smtp23.orange.fr (smtp23.orange.fr [193.252.22.126]) by unicode.org (8.12.11/8.12.11) with ESMTP id m364FPTO026203; Sat, 5 Apr 2008 22:15:26 -0600 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2326.orange.fr (SMTP Server) with ESMTP id 0F8107000087; Sun, 6 Apr 2008 06:15:20 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2326.orange.fr (SMTP Server) with ESMTP id 9A03A7000082; Sun, 6 Apr 2008 06:15:19 +0200 (CEST) X-ME-UUID: 20080406041519630.9A03A7000082@mwinf2326.orange.fr Reply-To: From: "Philippe Verdy" To: , "'Mark Davis'" Cc: "'Rick McGowan'" , References: <200804011546.m31FkaaI001923@unicode.org> <035e01c8946a$5a626060$0a01a8c0@HARNON> <30b660a20804030732g44af2f7erf41a264104bba46@mail.gmail.com> <043f01c8974e$2456a300$0a01a8c0@HARNON> Subject: RE: Slow Unicode.org server yesterday (March 31) Date: Sun, 6 Apr 2008 06:14:49 +0200 Organization: Ordinateur Personnel Message-ID: <045a01c8979c$c1b053f0$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <043f01c8974e$2456a300$0a01a8c0@HARNON> Thread-Index: AciVm5J6CSdFVvYRRsW0Bsta41LxfwBsAHlQABIGtpA= X-archive-position: 456 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users I've also noted similar SQL issues (with missing joins and non selective indexes) in the SQL tables and queries present in the "org.unicode.cldr.web.Vetting" class. The costly joins with the CLDR_DATA that are most probably missing a "distinct" selection clause, or not using it appropriately is in the missingImpliedVotes, which reads very inefficient as (formatted for easier understanding): select distinct CLDR_DATA.submitter, CLDR_DATA.base_xpath from CLDR_DATA where CLDR_DATA.submitter is not null and CLDR_DATA.locale = ? and not exists(select * from CLDR_VET where CLDR_DATA.locale = CLDR_VET.locale and CLDR_DATA.base_xpath = CLDR_VET.base_xpath and CLDR_DATA.submitter = CLDR_VET.submitter ) This performs a full join of the two tables between CLDR_DATA and CLDR_VET, despite there's a restriction on the locale in the CLDR_DATA table: the index on CLDR_DATA is not selective enough on the locale, and subqueries are not optimized by MySQL in the query execution plan. I think it is best to just perform this query: select distinct CLDR_DATA.submitter, CLDR_DATA.base_xpath from CLDR_DATA where CLDR_DATA.submitter is not null and CLDR_DATA.locale = ? And store rows of the SQL ResultSet it in a HashSet "x" using "x.add()", then execute another query listing the distinct rows that match the join: select distinct CLDR_DATA.submitter, CLDR_DATA.base_xpath from CLDR_VET, /* smallest table first */ CLDR_DATA /* largest table second */ where CLDR_VET.locale = ? /*and CLDR_VET.submitter is not null ; not needed due to the join: */ and CLDR_DATA.locale = CLDR_VET.locale and CLDR_DATA.base_xpath = CLDR_VET.base_xpath and CLDR_DATA.submitter = CLDR_VET.submitter /* both are implicitly not null */ (This join is already in best form for the QEP, even in absence of table statistics for the query optimizer, if you are effectively maintaining the index statistics, so that their selectivity is correctly evaluated). And then remove rows of the SQL ResultSet from the HashSet using "x.remove()". Note also that the (non-unique) index on XLD_VET is also using the "locale" column as its first column (the second column is base_xpath) however here also, it is not very selective. Remember that MySQL is not Oracle or Sybase, its query optimizer is still not smart enough to get the correct QEP; you can help it in the application. This query that use "where exists()" subqueries could be performed in a single query with a join: - staleResult : select CLDR_RESULT.id, CLDR_RESULT.base_xpath from CLDR_RESULT where CLDR_RESULT.locale = ? and exists(select * from CLDR_VET where CLDR_VET.base_xpath = CLDR_RESULT.base_xpath and CLDR_VET.locale = CLDR_RESULT.locale and CLDR_VET.modtime > CLDR_RESULT.modtime ) which is equivalent (provided that the above does not expect duplicate rows), but faster with a natural join with a distinct clause: select distinct CLDR_RESULT.id, CLDR_RESULT.base_xpath from CLDR_VET, CLDR_RESULT where CLDR_VET.locale = ? and CLDR_RESULT.base_xpath = CLDR_VET.base_xpath and CLDR_RESULT.locale = CLDR_VET.locale and CLDR_RESULT.modtime < CLDR_VET.modtime However this query is a bit less critical as those using the voluminous CLDR_DATA, as it does not involve multiple data from multiple submitters (users) and sources (external XML data files) and multiple locales, but only from multiple locales (this is still lot of data due to the growing product of the number of locales and growing number of localizable xpaths in the CLDR). The same could be performed in other statements that use "where not exists()", that MySQL visibly cannot optimize correctly (Sybase and Oracle optimizers can choose several strategies between: using a single join for the inner subquery and then computing differences of hashed set; or merging sorted sets or sorted indexes; but MySQL is much more limited and must be guided, even if this is at the price of returning two successive larger ResultSets to the client that will compute the difference itself): - missingResults - missingResults2 Finally, it looks like that many isolated queries (returning fully indexed single rows or no row at all) are performed within a loop where the checks could be extracted from the loop and performed directly by a join on the whole SQL collection from which they are searched, or by reading the tables completely in memory instead of requesting the external SQL engine for each checked row of large collections: Notably all the queries that are referencing the CLDR_VET (for the current submitter only, for queries that are bound to a precise value of it) and CLDR_OUTPUT tables using a precise locale could simply cache it in memory: it would be the same as parsing and storing in memory the complete content of a LDML file for that locale, and it is not excessive. In that case, a single SQL query to the engine will retrieve these tables, and the rest can be checked directly from HashSets in Java without requerying MySQL again and again. From verdy_p@wanadoo.fr Sun Apr 6 17:37:59 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Sun, 06 Apr 2008 17:37:59 -0500 (CDT) Received: from smtp23.orange.fr (smtp23.orange.fr [80.12.242.97]) by unicode.org (8.12.11/8.12.11) with ESMTP id m36Mbwbb025288; Sun, 6 Apr 2008 17:37:58 -0500 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2354.orange.fr (SMTP Server) with ESMTP id BCF8370000B2; Mon, 7 Apr 2008 00:37:52 +0200 (CEST) Received: from HARNON (APoitiers-258-1-121-153.w90-50.abo.wanadoo.fr [90.50.96.153]) by mwinf2354.orange.fr (SMTP Server) with ESMTP id 0D71970000B1; Mon, 7 Apr 2008 00:37:52 +0200 (CEST) X-ME-UUID: 20080406223752551.0D71970000B1@mwinf2354.orange.fr Reply-To: From: "Philippe Verdy" To: , "'Mark Davis'" Cc: "'Rick McGowan'" , References: <200804011546.m31FkaaI001923@unicode.org> <035e01c8946a$5a626060$0a01a8c0@HARNON> <30b660a20804030732g44af2f7erf41a264104bba46@mail.gmail.com> <043f01c8974e$2456a300$0a01a8c0@HARNON> <045a01c8979c$c1b053f0$0a01a8c0@HARNON> Subject: RE: Slow Unicode.org server yesterday (March 31) Date: Mon, 7 Apr 2008 00:37:20 +0200 Organization: Ordinateur Personnel Message-ID: <046501c89836$c6de6a20$0a01a8c0@HARNON> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 In-Reply-To: <045a01c8979c$c1b053f0$0a01a8c0@HARNON> Thread-Index: AciVm5J6CSdFVvYRRsW0Bsta41LxfwBsAHlQABIGtpAAJtX/wA== Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id m36Mbwbb025288 X-archive-position: 457 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: verdy_p@wanadoo.fr Precedence: bulk X-list: cldr-users Several suggestions: (1) SPLITTING CLDR_DATA Another optimization that I've tried in my local copy is to split the CLDR_DATA in two distinct tables with the same data structure except that: - one table, CLDR_USERDATA, contains the "submitter" column (which becomes NOT NULL and references CLDR_USER.id), for the data submitted by registered users, but no "source" column. - the other, CLDR_SRCDATA, contains the "source" column (which becomes NOT NULL and references CLDR_SRC.id) but no "submitter" data (when a XML source file is used, it is not casting a vote from its author, for the survey itself). It could be useful for the stability of the database, because the "source" data does not change as often and is not changed directly from the web interface by submitters, but by using adminstrative tools that will integrate versioned XML files or plain text files or HTML data sources, then parsed and converted to LDML format to link them with the appropriate XPaths. This simplifies several SQL queries, avoid testing columns for null, reduce the cardinality of requests, but for some other uses (like computing the winning data item and the position of "blue stars" in the list of items), it may require an union of two requests. Unions are not very costly (their cardinality is just additive, instead of generating products of cardinalities with the current data model), and even if the SQL engine does not support unions, they are easily supported in the Java client code using HashSets. (2) "alt" ATTRIBUTES AND THEIR POSITON IN XPATHS. Also I wonder if the varchar(50) columns "type" (which generates the XML attribute "[@type=XXX]"), "alt_type" (which generates the XML attribute "[@alt=XXX]"), and "alt_proposed" (which generates the XML attribute "[@proposed=XXX]") are best placed in the CLDR_DATA table; my opinion is that they are better placed in the CLDR_XPATHS table, along with the other attribute values that are stored in the Xpath for the parent elements. Notably, the "type" column is most probably not part of the vetting process (if new types can be added in the Survey, it is most probably a new "proposed" item (for example when proposing new types of date format). On contrast, the "type" column is used for enumerating currency codes, or territory codes, or language codes, but they should not be added by users directly but by adding them first in the Root locale, and making them available through the shared "CLDR_XPATHS" table. The same is true for the "alt" column (I think it's used for the name variants such as those for the short names of Hong Kong S.A.R. and Macao S.A.R., and whose distinction is not possible with the "type" attribute that references the same ISO 3166-1 territory codes). The "alt" variants are not necessarily needing data in all locales, and may have locale-specific uses, but if they span a lot more distinct "type"s and several locales, they should become subelements with their own Xpath (like the "type" variants for narrow/abbreviated/wide), after updating the LDML data model. If "alt" is used, it is only for identifiable variants for general purpose in all locales (when there are missing values, they should resolve to the default variant, through the system of Xpath aliases, that can be inited in the data for the Root locale and then inherited by default in all locales). (3) DATA DUMPS AND TESTING When I tried to play with the code, I had no data dump of the tables, so I just used the available XML data sources (OK, I don't want the content of the CLDR_USERS or CLDR_ORG table, but at least it would be useful if you provided some metrics about how many rows there are in each table, so that I can generate enough random users and orgs that is significant for looking at the performance and behavior of the SQL code). One way to increase the collaboration on code would be to extract the dump, and then anonymize the private fields (all user names randomized except the default administrator user, all passwords randomized, may just be the same as the generated user name, and emails replaced by "nobody@example.com" possibly with the numeric user id in prefix, all organization names randomized; to randomize the names, just use a generic prefix and append the numeric id). Then make those anonimyzed dumps somewhere available in the repository in a public "/dump" subfolder of "/org/unicode/web/cldr/data". This extracted dump could be similar to those data dumps you've already integrated in the administrative menus, with just the anonymizing procedure added for sensitive private fields. It would be useful to have some info about how the properties files are filled on the site (I've used the default values integrated in the Java code, then changed just a few settings just to indicate the parameters needed to connect to the MySQL database, i.e. the JDBC parameters. I had to guess some other properties to make it work, by looking in the Java sources to see how they are effectively used. This is much less critical. > -----Message d'origine----- > De : cldr-users-bounce@unicode.org > [mailto:cldr-users-bounce@unicode.org] De la part de Philippe Verdy > Envoyé : dimanche 6 avril 2008 06:15 > À : verdy_p@wanadoo.fr; 'Mark Davis' > Cc : 'Rick McGowan'; cldr-users@unicode.org > Objet : RE: Slow Unicode.org server yesterday (March 31) > > I've also noted similar SQL issues (with missing joins and > non selective > indexes) in the SQL tables and queries present in the > "org.unicode.cldr.web.Vetting" class. > > The costly joins with the CLDR_DATA that are most probably > missing a "distinct" selection clause, or not using it > appropriately is in the missingImpliedVotes, which reads very > inefficient as (formatted for easier > understanding): > > select distinct > CLDR_DATA.submitter, > CLDR_DATA.base_xpath > from CLDR_DATA > where CLDR_DATA.submitter is not null > and CLDR_DATA.locale = ? > and not exists(select * > from CLDR_VET > where CLDR_DATA.locale = > CLDR_VET.locale > and CLDR_DATA.base_xpath = CLDR_VET.base_xpath > and CLDR_DATA.submitter = CLDR_VET.submitter > ) > > This performs a full join of the two tables between CLDR_DATA > and CLDR_VET, despite there's a restriction on the locale in > the CLDR_DATA table: the index on CLDR_DATA is not selective > enough on the locale, and subqueries are not optimized by > MySQL in the query execution plan. > > I think it is best to just perform this query: > > select distinct > CLDR_DATA.submitter, > CLDR_DATA.base_xpath > from CLDR_DATA > where CLDR_DATA.submitter is not null > and CLDR_DATA.locale = ? > > And store rows of the SQL ResultSet it in a HashSet "x" using > "x.add()", then execute another query listing the distinct > rows that match the join: > > select distinct > CLDR_DATA.submitter, > CLDR_DATA.base_xpath > from CLDR_VET, /* smallest table first */ > CLDR_DATA /* largest table second */ > where CLDR_VET.locale = ? > /*and CLDR_VET.submitter is not null ; not needed due > to the join: > */ > and CLDR_DATA.locale = CLDR_VET.locale > and CLDR_DATA.base_xpath = CLDR_VET.base_xpath > and CLDR_DATA.submitter = CLDR_VET.submitter /* both are > implicitly not null */ > > (This join is already in best form for the QEP, even in > absence of table statistics for the query optimizer, if you > are effectively maintaining the index statistics, so that > their selectivity is correctly evaluated). > > And then remove rows of the SQL ResultSet from the HashSet > using "x.remove()". Note also that the (non-unique) index on > XLD_VET is also using the "locale" column as its first column > (the second column is base_xpath) however here also, it is > not very selective. > > Remember that MySQL is not Oracle or Sybase, its query > optimizer is still not smart enough to get the correct QEP; > you can help it in the application. > > This query that use "where exists()" subqueries could be > performed in a single query with a join: > - staleResult : > > select > CLDR_RESULT.id, CLDR_RESULT.base_xpath > from CLDR_RESULT > where CLDR_RESULT.locale = ? > and exists(select * > from CLDR_VET > where CLDR_VET.base_xpath = > CLDR_RESULT.base_xpath > and CLDR_VET.locale = > CLDR_RESULT.locale > and CLDR_VET.modtime > > CLDR_RESULT.modtime > ) > > which is equivalent (provided that the above does not > expect duplicate > rows), but faster with a natural join with a distinct clause: > > select distinct > CLDR_RESULT.id, > CLDR_RESULT.base_xpath > from CLDR_VET, > CLDR_RESULT > where CLDR_VET.locale = ? > and CLDR_RESULT.base_xpath = CLDR_VET.base_xpath > and CLDR_RESULT.locale = CLDR_VET.locale > and CLDR_RESULT.modtime < CLDR_VET.modtime > > However this query is a bit less critical as those using the > voluminous CLDR_DATA, as it does not involve multiple data > from multiple submitters > (users) and sources (external XML data files) and multiple > locales, but only from multiple locales (this is still lot of > data due to the growing product of the number of locales and > growing number of localizable xpaths in the CLDR). > > The same could be performed in other statements that use > "where not exists()", that MySQL visibly cannot optimize > correctly (Sybase and Oracle optimizers can choose several > strategies between: using a single join for the inner > subquery and then computing differences of hashed set; or > merging sorted sets or sorted indexes; but MySQL is much more > limited and must be guided, even if this is at the price of > returning two successive larger ResultSets to the client that > will compute the difference itself): > - missingResults > - missingResults2 > > Finally, it looks like that many isolated queries (returning > fully indexed single rows or no row at all) are performed > within a loop where the checks could be extracted from the > loop and performed directly by a join on the whole SQL > collection from which they are searched, or by reading the > tables completely in memory instead of requesting the > external SQL engine for each checked row of large collections: > > Notably all the queries that are referencing the CLDR_VET > (for the current submitter only, for queries that are bound > to a precise value of it) and CLDR_OUTPUT tables using a > precise locale could simply cache it in memory: > it would be the same as parsing and storing in memory the > complete content of a LDML file for that locale, and it is > not excessive. In that case, a single SQL query to the engine > will retrieve these tables, and the rest can be checked > directly from HashSets in Java without requerying MySQL again > and again. > > > > > > From yury.tarasievich@gmail.com Tue Apr 8 06:53:46 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 06:53:47 -0500 (CDT) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.174]) by unicode.org (8.12.11/8.12.11) with ESMTP id m38BrgZ1031533 for ; Tue, 8 Apr 2008 06:53:46 -0500 Received: by ug-out-1314.google.com with SMTP id c2so727113ugf.27 for ; Tue, 08 Apr 2008 04:53:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:to:subject:content-type:mime-version:content-transfer-encoding:message-id:user-agent:from; bh=s/rGfGD8+VejzwIkCB/27hL8iSgNwLh9JOzRZjxfxb8=; b=Jsuf3nwiqz+o9/00rqMf8hWn56qSWV7vY7SV6MBgT43Rxhfee/oDnf5IMuuqA8vFHnsBKFzS7LZdhTqURPmimrnwOc6ubNL6eO79gKB4N4Uoh90pwBqkon9dvRdLVdF2DjhqNDBYvEjlu2y+UbiblyszhfTXhYLAhTRoC9u70sE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:to:subject:content-type:mime-version:content-transfer-encoding:message-id:user-agent:from; b=X6TkWDgbT/FHoLc4/QxaFtFttGiQLeQdqEr2EciX75zxEbWzEdKIE8o76BvxuXCHHtqJhML6euDfvSNWvMVVCyAt/tJONIa1whijllLO9FwGp08g6rPnyFQ0/Tpq14hyPSvglP0CnBT0/fwg8iVPSXhopZHZ4lXjTUVnpGf6Zh4= Received: by 10.67.20.19 with SMTP id x19mr2976598ugi.48.1207655621499; Tue, 08 Apr 2008 04:53:41 -0700 (PDT) Received: from mobile29.grsu.by ( [81.30.81.200]) by mx.google.com with ESMTPS id 13sm2925065ugb.0.2008.04.08.04.53.18 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 08 Apr 2008 04:53:25 -0700 (PDT) Date: Tue, 08 Apr 2008 14:53:49 +0300 To: cldr-users@unicode.org Subject: Is anybody controlling the lingual quality out there? Content-Type: text/plain; charset=utf-8 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID: User-Agent: Opera Mail/9.27 (Linux) From: Yury Tarasievich X-archive-position: 458 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: yury.tarasievich@gmail.com Precedence: bulk X-list: cldr-users Is anybody at all controlling the lingual quality of what goes into locales? Right now somebody screwed the Belarusian locale info completely, using the proposal for the orthography of the alternative (not academic) literary norm as a source. If made in good faith that was not very competent. If made intentionally, well, anyway it confirms that real quality control is absent at CLDR, at least where smaller entities are concerned. Please recall, we had this sort of situation before, at the 1.5.1 stage. And so, nothing's changed from then, and the setup is practically an invitation to a screwups such as this. I repeat my former proposition — to strike out the Belarusian locale completely, if the controlling organisation can't provide the elementary quality of what it publishes. I personally don't wish to waste my time anymore disputing the issues elementary as that. --- From yury.tarasievich@gmail.com Tue Apr 8 07:06:24 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 07:06:24 -0500 (CDT) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.159]) by unicode.org (8.12.11/8.12.11) with ESMTP id m38C6NmR003756 for ; Tue, 8 Apr 2008 07:06:24 -0500 Received: by fg-out-1718.google.com with SMTP id 13so1683982fge.9 for ; Tue, 08 Apr 2008 05:06:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:to:subject:content-type:mime-version:references:content-transfer-encoding:message-id:in-reply-to:user-agent:from; bh=258P6WfBu75zAvVkxMA2wOh+MG2+cbB2uJE2OtPzP8g=; b=URfvaUKFJWEv0XWv2FLSt33jBM4xideDB4hh3csx0QeeeDkYk9wtyIKMPW4SK7Ve8fVd/pkWPsyarrzvYU6N2Ahk0vwCzZHNMvjTTv5hU5wrJmUqRqHSrvu+aDOs+7CBRokScycP8bk1wVHOn6v3wnI/RUNL608aKv1avCV2egk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:to:subject:content-type:mime-version:references:content-transfer-encoding:message-id:in-reply-to:user-agent:from; b=uL4seQH2Aj0bjg2n52mq5M/fM5tvyy8yvCp3pGlovCgfzL4csQ08XsJGrimHzZ1j6/o++Oy+fZVJ7dWCXodHHhbfmqX2wLuoCSBmIxET1BcwYSaeY1N1uGEAw15yDFQe4qEfol60awwz2StJVs2P32ZU6RlVaADi9FyvvUoXrEM= Received: by 10.86.96.18 with SMTP id t18mr4483745fgb.13.1207656382824; Tue, 08 Apr 2008 05:06:22 -0700 (PDT) Received: from mobile29.grsu.by ( [81.30.81.200]) by mx.google.com with ESMTPS id z37sm14892424ikz.1.2008.04.08.05.06.15 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 08 Apr 2008 05:06:21 -0700 (PDT) Date: Tue, 08 Apr 2008 15:06:28 +0300 To: cldr-users@unicode.org Subject: Re: Is anybody controlling the lingual quality out there? Content-Type: text/plain; charset=utf-8 MIME-Version: 1.0 References: Content-Transfer-Encoding: 7bit Message-ID: In-Reply-To: User-Agent: Opera Mail/9.27 (Linux) From: Yury Tarasievich X-archive-position: 459 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: yury.tarasievich@gmail.com Precedence: bulk X-list: cldr-users On Tue, 08 Apr 2008 14:53:49 +0300, Yury Tarasievich wrote: > Is anybody at all controlling the lingual quality of what goes into locales? ... Also I would like to have an authoritative comment on the following snippet of the CLDR be: forum comment, please: > I personally believe that "umbrella" code should represent the set of rules which is used by most people/writers/users who write/read in the language. In case of Belarusian, it would be the set of rules "????????? ???????? ????????". So, is the CLDR locale acting as some kind of Internet polling tool, or is it representing the normative lingual information, after all? To be clear, I'm NOT endorsing any other assertions in those forum post. --- From eik@iki.fi Tue Apr 8 07:54:25 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 07:54:25 -0500 (CDT) Received: from smtp6.pp.htv.fi (smtp6.pp.htv.fi [213.243.153.40]) by unicode.org (8.12.11/8.12.11) with ESMTP id m38CsORu027478 for ; Tue, 8 Apr 2008 07:54:25 -0500 Received: from inspiron (cs181253188.pp.htv.fi [82.181.253.188]) by smtp6.pp.htv.fi (Postfix) with ESMTP id 6436A5BC062; Tue, 8 Apr 2008 15:54:23 +0300 (EEST) Reply-To: From: "Erkki I. Kolehmainen" To: "'Yury Tarasievich'" , Subject: RE: Is anybody controlling the lingual quality out there? Date: Tue, 8 Apr 2008 15:54:22 +0300 Message-ID: <000001c89977$aab78150$0200a8c0@inspiron> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.6838 Importance: Normal Thread-Index: AciZcEAMn+7FqstXRnKYbjqGnb5VIgABGAcg In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id m38CsORu027478 X-archive-position: 460 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: eik@iki.fi Precedence: bulk X-list: cldr-users I'd like to point out that there is no possibility that the CLDR TC members individually or as a group could have authoritative knowledge of the many languages in the data base. The committee has set up a mechanism for the experts in a given language and territory to agree what values should be used. The voting mechanism and the discussion forum (with the stated references) should lead to a consensus with appreciable quality, provided that the contributors are prepared to consider also other opinions than their own. If everyone just inputs his or her view of the right data and then walks away without participating in the ensuing debate, nobody can be proud of the end result. Sincerely, Erkki I. Kolehmainen Tilkankatu 12 A 3, FI-00300 Helsinki, Finland Puh. (09) 4368 2643, 0400 825 943; Tel. +358 9 4368 2643, +358 400 825 943 -----Alkuperäinen viesti----- Lähettäjä: cldr-users-bounce@unicode.org [mailto:cldr-users-bounce@unicode.org] Puolesta Yury Tarasievich Lähetetty: 8. huhtikuuta 2008 14:54 Vastaanottaja: cldr-users@unicode.org Aihe: Is anybody controlling the lingual quality out there? Is anybody at all controlling the lingual quality of what goes into locales? Right now somebody screwed the Belarusian locale info completely, using the proposal for the orthography of the alternative (not academic) literary norm as a source. If made in good faith that was not very competent. If made intentionally, well, anyway it confirms that real quality control is absent at CLDR, at least where smaller entities are concerned. Please recall, we had this sort of situation before, at the 1.5.1 stage. And so, nothing's changed from then, and the setup is practically an invitation to a screwups such as this. I repeat my former proposition — to strike out the Belarusian locale completely, if the controlling organisation can't provide the elementary quality of what it publishes. I personally don't wish to waste my time anymore disputing the issues elementary as that. --- From dzo@bisharat.net Tue Apr 8 07:58:24 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 07:58:24 -0500 (CDT) Received: from kabissa.org (113166.kabissa.org [72.32.199.201]) by unicode.org (8.12.11/8.12.11) with ESMTP id m38CwL8w028783 for ; Tue, 8 Apr 2008 07:58:24 -0500 Received: (qmail 26881 invoked from network); 8 Apr 2008 07:58:20 -0500 Received: from pool-71-127-61-145.washdc.east.verizon.net (HELO IBM92AA25595C4) (71.127.61.145) by 72.32.229.137 with SMTP; 8 Apr 2008 07:58:19 -0500 From: "Don Osborn" To: "'Yury Tarasievich'" , Cc: "'A12n tech support'" References: In-Reply-To: Subject: RE: Is anybody controlling the lingual quality out there? (CLDR) Date: Tue, 8 Apr 2008 08:58:15 -0400 Message-ID: <02ee01c89978$3654c6f0$a2fe54d0$@net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AciZb2JRVIpNIBHvSwiSdPLEcG0ePQABkx3Q Content-Language: en-us Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id m38CwL8w028783 X-archive-position: 461 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: dzo@bisharat.net Precedence: bulk X-list: cldr-users I can't answer your question regarding quality control of specific locales in the Common Locale Data Repository (CLDR). However, I would like to mention that this sort of issue will likely become more frequent as we get into languages with shorter written traditions, multiple dialects, multiple authorities setting standards for things like orthographies, changes in those standards, and sometimes differences between what people do and the formal standards (such has been noted for example with some languages in Nigeria due in part to lack of Unicode fonts). There is also the question of how well the 639-1&2 and 639-3 codes work for locales and localization for some languages / macrolanguages / language clusters. As we get into a period when locales will be generated for more African languages, I think these issues will all come to the fore. There will probably be a need for more proactive vetting than we've seen with locales up until now - i.e., an announcement and general call for comments may not suffice. Instead, experts in specific languages and from relevant countries may have to be sought out for their input. Don Osborn Bisharat.net > -----Original Message----- > From: cldr-users-bounce@unicode.org [mailto:cldr-users- > bounce@unicode.org] On Behalf Of Yury Tarasievich > Sent: Tuesday, April 08, 2008 7:54 AM > To: cldr-users@unicode.org > Subject: Is anybody controlling the lingual quality out there? > > Is anybody at all controlling the lingual quality of what goes into > locales? > ... From aaron@ijigg.com Tue Apr 8 15:22:19 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 15:22:19 -0500 (CDT) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.154]) by unicode.org (8.12.11/8.12.11) with ESMTP id m38KMEH6030179 for ; Tue, 8 Apr 2008 15:22:18 -0500 Received: by fg-out-1718.google.com with SMTP id 13so1829398fge.9 for ; Tue, 08 Apr 2008 13:22:12 -0700 (PDT) Received: by 10.86.89.4 with SMTP id m4mr5379511fgb.14.1207686132573; Tue, 08 Apr 2008 13:22:12 -0700 (PDT) Received: by 10.86.53.15 with HTTP; Tue, 8 Apr 2008 13:22:12 -0700 (PDT) Message-ID: <756ec90c0804081322h1e156bd3xbfe3945d414b90b8@mail.gmail.com> Date: Tue, 8 Apr 2008 13:22:12 -0700 From: "Aaron Brick" To: cldr-users@unicode.org Subject: Re: Is anybody controlling the lingual quality out there? In-Reply-To: <000001c89977$aab78150$0200a8c0@inspiron> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_20846_5571492.1207686132524" References: <000001c89977$aab78150$0200a8c0@inspiron> X-archive-position: 462 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: aaron@ijigg.com Precedence: bulk X-list: cldr-users ------=_Part_20846_5571492.1207686132524 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline i rather agree with erkki. the discussion period is the reason there are only occasional official releases; the repository contents of a collaborative project at an arbitrary moment in time are relatively suspect= , and indeed there to be edited and debated. aaron. On Tue, Apr 8, 2008 at 5:54 AM, Erkki I. Kolehmainen wrote: > I'd like to point out that there is no possibility that the CLDR TC > members > individually or as a group could have authoritative knowledge of the many > languages in the data base. The committee has set up a mechanism for the > experts in a given language and territory to agree what values should be > used. The voting mechanism and the discussion forum (with the stated > references) should lead to a consensus with appreciable quality, provided > that the contributors are prepared to consider also other opinions than > their own. If everyone just inputs his or her view of the right data and > then walks away without participating in the ensuing debate, nobody can b= e > proud of the end result. > > Sincerely, > > Erkki I. Kolehmainen > Tilkankatu 12 A 3, FI-00300 Helsinki, Finland > Puh. (09) 4368 2643, 0400 825 943; Tel. +358 9 4368 2643, +358 400 825 94= 3 > > -----Alkuper=E4inen viesti----- > L=E4hett=E4j=E4: cldr-users-bounce@unicode.org > [mailto:cldr-users-bounce@unicode.org] Puolesta Yury Tarasievich > L=E4hetetty: 8. huhtikuuta 2008 14:54 > Vastaanottaja: cldr-users@unicode.org > Aihe: Is anybody controlling the lingual quality out there? > > > Is anybody at all controlling the lingual quality of what goes into > locales? > > Right now somebody screwed the Belarusian locale info completely, using > the > proposal for the orthography of the alternative (not academic) literary > norm > as a source. If made in good faith that was not very competent. If made > intentionally, well, anyway it confirms that real quality control is > absent > at CLDR, at least where smaller entities are concerned. > > Please recall, we had this sort of situation before, at the 1.5.1 stage. > And > so, nothing's changed from then, and the setup is practically an > invitation > to a screwups such as this. > > I repeat my former proposition =97 to strike out the Belarusian locale > completely, if the controlling organisation can't provide the elementary > quality of what it publishes. I personally don't wish to waste my time > anymore disputing the issues elementary as that. > > --- > > > > ------=_Part_20846_5571492.1207686132524 Content-Type: text/html; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline i rather agree with erkki. the discussion period is the reason there are on= ly occasional official releases; the repository contents of a collaborative= project at an arbitrary moment in time are relatively suspect, and indeed = there to be edited and debated.
aaron.


On Tue, Apr 8, 2008 at 5:54 AM= , Erkki I. Kolehmainen <eik@iki.fi>= wrote:
I'd like to point out that there is no possibility that the CLDR TC mem= bers
individually or as a group could have authoritative knowledge of the many languages in the data base. The committee has set up a mechanism for the experts in a given language and territory to agree what values should be used. The voting mechanism and the discussion forum (with the stated
references) should lead to a consensus with appreciable quality, provided that the contributors are prepared to consider also other opinions than
their own. If everyone just inputs his or her view of the right data and then walks away without participating in the ensuing debate, nobody can be<= br> proud of the end result.

Sincerely,

Erkki I. Kolehmainen
Tilkankatu 12 A 3, FI-00300 Helsinki, Finland
Puh. (09) 4368 2643, 0400 825 943; Tel. +358 9 4368 2643, +358 400 825 943<= br>
-----Alkuper=E4inen viesti-----
L=E4hett=E4j=E4: cldr-user= s-bounce@unicode.org
[mailto:cldr-users-bounce@= unicode.org] Puolesta Yury Tarasievich
L=E4hetetty: 8. huhtikuuta 2008 14:54
Vastaanottaja: cldr-users@unicode= .org
Aihe: Is anybody controlling the lingual quality out there?


Is anybody at all controlling the lingual quality of what goes into locales= ?

Right now somebody screwed the Belarusian locale info completely, using the=
proposal for the orthography of the alternative (not academic) literary nor= m
as a source. If made in good faith that was not very competent. If made
intentionally, well, anyway it confirms that real quality control is absent=
at CLDR, at least where smaller entities are concerned.

Please recall, we had this sort of situation before, at the 1.5.1 stage. An= d
so, nothing's changed from then, and the setup is practically an invita= tion
to a screwups such as this.

I repeat my former proposition =97 to strike out the Belarusian locale
completely, if the controlling organisation can't provide the elementar= y
quality of what it publishes. I personally don't wish to waste my time<= br> anymore disputing the issues elementary as that.

---




------=_Part_20846_5571492.1207686132524-- From srl@icu-project.org Tue Apr 8 20:26:53 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Tue, 08 Apr 2008 20:26:54 -0500 (CDT) Received: from k2smtpout01-01.prod.mesa1.secureserver.net (k2smtpout01-02.prod.mesa1.secureserver.net [64.202.189.89]) by unicode.org (8.12.11/8.12.11) with SMTP id m391QrkN032556 for ; Tue, 8 Apr 2008 20:26:53 -0500 Received: (qmail 24797 invoked from network); 9 Apr 2008 01:26:47 -0000 Received: from unknown (HELO ssl.icu-project.org) (208.109.248.225) by k2smtpout01-01.prod.mesa1.secureserver.net (64.202.189.88) with ESMTP; 09 Apr 2008 01:26:47 -0000 Received: from [129.42.184.35] (helo=tintin-009043104101.sanjose.ibm.com) by ssl.icu-project.org with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1JjOyU-0000fv-3w; Tue, 08 Apr 2008 18:19:42 -0700 Message-ID: <47FC1B4D.6020107@icu-project.org> Date: Tue, 08 Apr 2008 18:26:37 -0700 From: "Steven R. Loomis" User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Yury Tarasievich CC: cldr-users@unicode.org Subject: Re: Is anybody controlling the lingual quality out there? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 463 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: srl@icu-project.org Precedence: bulk X-list: cldr-users Yury, Yury Tarasievich wrote: > Is anybody at all controlling the lingual quality of what goes into locales? > That is precisely the job of the vetting community, of which you are a member. Who else would be the 'anybody' which you propose? Also, nothing has gone 'into' locales for 1.6 yet, it is still in progress. Nothing prevents registered vetters from adding other proposals. > Right now somebody screwed the Belarusian locale info completely, The vetting period is not over, so I do not understand the finality of this statement. CLDR 1.6 is still in progress. > ... using the proposal for the orthography of the alternative (not academic) literary norm as a source. If that is what is happening, then that is the proposal. You can discuss this in the forum (which it seems you are doing). If there really should be two subcodes for Belarusian as mentioned, you will need to explain this so that it can be understood by the CLDR Technical Committee. You can use this forum or the feedback form link at the bottom of the Survey Tool. > If made in good faith that was not very competent. If made intentionally, well, anyway it confirms that real quality control is absent at CLDR, at least where smaller entities are concerned. > Again, the vetting process *is* the quality control. We aren't asking Belarusian experts to come up with some data, so that some mysterious other Belarusian experts can perform quality control. You did the right thing by discussing this in the Belarusian forum, continue to do so so that there is some agreement on what the data should be. > Please recall, we had this sort of situation before, at the 1.5.1 stage. And so, nothing's changed from then, and the setup is practically an invitation to a screwups such as this. > I went back and re-read the emails from you which I could find, circa January through July of last year. I think there were specific issues where imported Google data was given too high of weight, and that was corrected as you remember. You asked about Academic review, but it is still not clear what you might propose. Submitting data which is marked as referencing authoritative sources would be good, as might having those with academic credentials review the data - via survey tool. Is this an issue where there are alternate legitimate orthographies which should be considered separately, or should be marked as alternative variants? I see six vetters listed for Belarusian, perhaps the others will chime in on the forum. The other four have not participated in Belarusian yet this cycle. > I repeat my former proposition — to strike out the Belarusian locale completely, if the controlling organisation can't provide the elementary quality of what it publishes. I personally don't wish to waste my time anymore disputing the issues elementary as that. > How do you propose that the 'elementary quality' be provided? Are you really saying that no items in Belarusian are correct? In a quick check of a couple of items, at least one of them which was winning, you had voted for. Why should it be removed? -s From yury.tarasievich@gmail.com Thu Apr 10 01:17:59 2008 Received: with ECARTIS (v1.0.0; list cldr-users); Thu, 10 Apr 2008 01:18:00 -0500 (CDT) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.171]) by unicode.org (8.12.11/8.12.11) with ESMTP id m3A6Hw83028276 for ; Thu, 10 Apr 2008 01:17:59 -0500 Received: by ug-out-1314.google.com with SMTP id c2so1239710ugf.27 for ; Wed, 09 Apr 2008 23:17:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:to:subject:content-type:mime-version:references:content-transfer-encoding:message-id:in-reply-to:user-agent:from; bh=+V3LjvNv5mv4qjSS4BjOlqubEwKY5v3yloy5oC4vY+M=; b=Hh7fxv6csjI60GYv9DpCkVnPvevlSqcG65wVMAePSUtKG8h9kcxSxUqfXd7a3dOLBo3ElEcXZtPU0kndxy/FmIXud3s0OKOsrDUDz/o7mK8E49qPbCGjfd0vTfIBJRQLJP+xjFDTMoa2ajUfv1epMw1h7YPfidlGV97Gn/vAfpA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:to:subject:content-type:mime-version:references:content-transfer-encoding:message-id:in-reply-to:user-agent:from; b=O6WDBraRUJq/Kh7FGPcwYGEf3CXdGJC6l3L65OMhnGL/OvEFGjoo9UcDXTDhnWnuI2LNlwFvQuTaYN7ivvkqXjLL5/PwAuzOt/9pcjRidpT1lSODQuR12zfym7sVjSXbyx681BCWkOLB0erVpfUIuwCDOzO7AS3MEX+mPZnP2Pw= Received: by 10.67.27.3 with SMTP id e3mr5663314ugj.22.1207808267109; Wed, 09 Apr 2008 23:17:47 -0700 (PDT) Received: from mobile29.grsu.by ( [81.30.81.200]) by mx.google.com with ESMTPS id 27sm14048394ugp.19.2008.04.09.23.17.30 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 09 Apr 2008 23:17:44 -0700 (PDT) Date: Thu, 10 Apr 2008 09:17:56 +0300 To: cldr-users@unicode.org Subject: Filed the issue #1679 on "Technical proposal on splitting the Belarusian dataset into two datasets, BE and BE-TARASK" Content-Type: text/plain; charset=utf-8 MIME-Version: 1.0 References: Content-Transfer-Encoding: 8bit Message-ID: In-Reply-To: User-Agent: Opera Mail/9.27 (Linux) From: Yury Tarasievich X-archive-position: 464 X-ecartis-version: Ecartis v1.0.0 Sender: cldr-users-bounce@unicode.org Errors-to: cldr-users-bounce@unicode.org X-original-sender: yury.tarasievich@gmail.com Precedence: bulk X-list: cldr-users Per the expert advice, I'm duplicating the issue #1679, which I just filed, concerning the "Technical proposal on splitting the Belarusian dataset into two datasets, BE and BE-TARASK". I'm attaching the text of the issue. *** Technical proposal on splitting the Belarusian dataset into two datasets, BE and BE-TARASK In the modern Belarusian language, as we know it in 2000s, there objectively exist two literary norms (cf. [Compendium 2003], [Klimaw 2004]). One is academic (normative, literary), existing in a relatively unchanged form for 75 years, taught at state school educational system. This norm is used for the official and state uses of Belarusian language. It is controlled by an Institute o