[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #3009(closed: fixed)

Opened 9 years ago

Last modified 8 years ago

POSIX Collation is not correct in CLDR POSIX Locale (1.9M1)

Reported by: suliu@… Owned by: emmons
Component: posix Data Locale:
Phase: Review: mark
Weeks: Data Xpath:


POSIX Collation is not correct in CLDR POSIX Locale (1.9M1)
CLDR POSIX locale data include POSIX LOCALE definistions in http://www.unicode.org/repos/cldr/trunk/posix/en_US_POSIX.UTF-8.src. In the data, the collation rules are NOT POSIX collation standard.
Based on the Open Source POSIX locale SPEC: http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html#tag_005_002
POSIX collation to be ASCII value based.
Bes Regards


Change History

comment:1 Changed 9 years ago by emmons

  • Owner changed from somebody to emmons
  • Status changed from new to assigned
  • Milestone set to 1.9RC

comment:2 Changed 8 years ago by emmons

  • Status changed from assigned to accepted
  • Review set to mark

comment:3 Changed 8 years ago by mark

The proposed solution won't work. That is, it orders everything after  , but that will cause every character to be ignorable if alternate sorting is on.

It also needs a test to verify that the ordering matches http://www.unicode.org/repos/cldr/trunk/posix/en_US_POSIX.UTF-8.src. Otherwise, we have no assurance that there are not deviations. The simplest would probably be to write a small program on a POSIX platform that is known to have implement that file correction, and sort the characters <X01>..<X5FD2>, adding "", "A", "a", "Ä", "ä" after each character to check for different levels. Then check in that file and write a simple test ICU that they are in the same order on ICU with this collation order.

comment:4 Changed 8 years ago by mark

  • Priority changed from assess to major
  • Review mark deleted

Sending back with comments.

comment:5 Changed 8 years ago by emmons

  • Review set to mark

comment:6 Changed 8 years ago by pedberg

  • Cc mark, emmons, umesh, pedberg added

This change breaks processing of this collation in genrb. I don't know if the problem is the changes to the collation file itself, or a problem in LDML2ICUConverter, or a problem in genrb. Here is what I see:

  1. The changes in common/collation/en_US_POSIX.xml are:
    -					<reset>&#x20;</reset>
    -					<pc>&#x21;&#x22;&#x23;&#x24;&#x25;&#x26;&#x27;&#x28;</pc>
    -					<pc>&#x29;&#x2a;&#x2b;&#x2c;&#x2d;&#x2e;&#x2f;</pc>
    +					<reset>A</reset>
    +					<pc>&#x20;&#x21;&#x22;&#x23;&#x24;&#x25;&#x26;&#x27;</pc>
    +					<pc>&#x28;&#x29;&#x2a;&#x2b;&#x2c;&#x2d;&#x2e;&#x2f;</pc>

The delta in the en_US_POSIX.txt produced by LDML2ICUConverter is as follows (the new line looks strange, perhaps the range logic in LDML2ICUConverter is confused by having A in two places in the rules?):

-            Sequence{"&'\u0020'<*'!'-'('')'-'/'0-9':'-'@'A-Z'['-'`'a-z'{'-'\u007F'"}
+            Sequence{"&A<*'\u0020'-'''('-'/'0-9':'-'@'A-Z'['-'`'a-z'{'-'\u007F'"}

And the error in genrb with that new txt file is:

DYLD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$DYLD_LIBRARY_PATH  ../bin/genrb -k -i ./out/build/icudt46l -s ./coll -d ./out/build/icudt46l/coll en_US_POSIX.txt
./coll/en_US_POSIX.txt:16: warning: %Collation could not be constructed from CollationElements - check context!
./coll/en_US_POSIX.txt:14: parse error. Stopped parsing resource with U_INVALID_FORMAT_ERROR
couldn't parse the file en_US_POSIX.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [out/build/icudt46l/coll/en_US_POSIX.res] Error 3

comment:7 Changed 8 years ago by mark

  • Status changed from accepted to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.