Re: The Unicode Standard and ISO from Mark Davis ☕️ via Unicode on 2018-06-08 (Unicode Mail List Archive)

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Fri, 8 Jun 2018 13:40:21 +0200

Mark

On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
unicode_at_unicode.org> wrote:

> On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> Marcel Schneider via Unicode <unicode_at_unicode.org> wrote:
>
> > Thank you for confirming. All witnesses concur to invalidate the
> > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > After being invented in its actual form, sorting was standardized
> > simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> > the latter including practice‐oriented extra features.
>
> The UCA contains features essential for respecting canonical
> equivalence. ICU works hard to avoid the extra effort involved,
> apparently even going to the extreme of implicitly declaring that
> Vietnamese is not a human language.

A bit over the top, eh?

> (Some contractions are not
> supported by ICU!)

I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, which
nicely outlines a proposal for dealing with a number of problems with
Vietnamese.

We clearly don't support every sorting feature that various dictionaries
and agencies come up with. Sometimes it is because we can't (yet) see a
good way to do it:

   1. it might be not determinant: many governmental standards or style
   sheets require "interesting" sorting, such as determining that "XI" is a
   roman numeral (not the president of China) and sorting as 11, or when "St."
   is meant to be Street *and* when meant to be Saint (St. Stephen's St.)
   2. the prospective cost in memory, code complexity, or performance, or
   the time necessary to figure out to do complex requirements, doesn't seem
   to warrant adding it at this point. Now, if you or others are interested
   in proposing specific patches to address certain issues, then you can
   propose that. Best to make a proposal (ticket) before doing the work,
   because if the solution is very intricate, even the time necessary to
   evaluate the patch can be too much to fit into the schedule. For that
   reason, it is best to break up such tickets into small, tractable pieces.

The synchronisation is manifest in the DUCET
> collation, which seems to make the effort to ensure that some canonical
> equivalent will sort the same way under ISO/IEC 14651.
>
> > Since then,
> > these two standards are kept in synchrony uninterruptedly.
>
> But the consortium has formally dropped the commitment to DUCET in
> CLDR. Even when restricted to strings of assigned characters, the CLDR
> and ICU no longer make the effort to support the DUCET collation.
> Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
> collation, even when restricted to assigned characters. Tailorings
> tend to have odd side effects; fortunately, they rarely if ever matter.
> CLDR root is a rewrite with modifications of DUCET; it has changes that
> are prohibited as 'tailorings'!
>

CLDR does make some tailorings to the DUCET to create its root collation,
notably adding special contractions of private use characters to allow for
tailoring support and indexes [
http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
] plus the rearrangement of some characters (mostly punctuation and
symbols) to allow runtime parametric reordering of groups of characters (eg
to put numbers after letters) [
http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
].

   - If there are other changes that are not well documented, or if you
   think those features are causing problems in some way, please file a
   ticket.
   - If there is a particular change that you think is not conformant to
   UCA, please also file that.

> Richard.
>
>
Received on Fri Jun 08 2018 - 06:41:01 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 08 2018 - 06:41:02 CDT