PRI

Public Review Issue #63: POSIX Data for CLDR

There is a new tool that creates POSIX locale data files from CLDR. It has been used to generate draft POSIX locale data files for public review. We encourage review of this data; any feedback can be filed at http://unicode.org/cldr/filing_bug_reports.html. (Note: the CLDR 1.3 freeze data has been extended to allow for feedback on this and other locale data.)

The draft files are available in http://unicode.org/cldr/data/common/posix/. Because POSIX locale data files are specific to charset, there are two kinds of files:

generated with the UTF-8 charset, such as http://unicode.org/cldr/data/common/posix/hi_IN.UTF-8.src
- These include all the locales
generated with other charsets, such as http://unicode.org/cldr/data/common/posix/de_DE.ISO8859-15.src
- These include just a few samples, for checking.

The main remaining issue at this point appears to be the repertoire of characters to be used for the UTF-8 locales. Currently the mechanism is to use the following heuristic:

start with the exemplar characters (main + auxiliary)
add the collation tailored characters (including the contractions and prefixes for LC_COLLATE),
add characters in the same script (for script values associated with sets of letters, and excluding Han),
add characters in the same block (excluding letters and unassigned characters),
add ASCII

Feedback on this and other issues is welcome.

Notes:

The tools actually have the ability override the above heuristic, and force the main repertoire set and/or the collation repertoire set by specifying a UnicodeSet pattern from the command line:
- GeneratePOSIX -u [\\u0000-\\U10ffff] -x [\\u0000-\\U10ffff] -m de_DE -c UTF-8
There are known bugs in the draft; those will show up in http://www.jtcsv.com/cgibin/locale-bugs/posix.
The LC_TYPE values are taken from Unicode.
Since POSIX model can't represent all of CLDR, the tool needs to "downcast" to the closest version, eg
- d_fmt, t_fmt: using the medium format
- d_t_fmt: using the long format
LC_MESSAGES data will be updated with new data from CLDR 1.3.