Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

From: Mark Davis ☕ <>
Date: Wed, 31 Aug 2011 09:26:19 -0700

Thanks for bringing this up.

*— Il meglio è l’inimico del bene —*

On Tue, Aug 30, 2011 at 19:20, Philippe Verdy <> wrote:

> In the proposed update of UTS#10 (UCA), subject to the PRI #203 just
> posted, I note the following addition in section 3.3.2 (Contractions).
> "Characters of a contraction can be made to sort as separate
> characters with the insertion of any starter character. There are two
> characters, soft hyphen and U+034F COMBINING GRAPHEME JOINER that are
> particularly useful for this purpose. These can be used to separate
> contractions that would normally be weighted as units, such as Slovak
> ch or Danish aa. For more information, see Section 5.3 Use of
> Combining Grapheme Joiner."
> However, in a past discussion here in this Unicode miling list, we
> discussed heavily about the fact that SOFT HYPHEN would not be
> appropriate to avoid contractions as it would also imply a break
> opportunity, which may be undesirable when the only intended thing was
> to prohibit things like contractions (for collation), or even
> ligatures.

The general character for that usage is the CGJ. In many languages the
soft-hyphen is also sufficient, but in languages (or circumstances) where it
isn't, the CGJ is the recommended character. Note the pointer to the
preexisting section on CGJ.

> Note also there's another collation-ignorable character for that,
> ZWNJ, which does not imply a break opportunity. But it is not clear if
> it implies that this also avoids the contraction, for collation
> purpose.

Because of the way the algorithm works, any starter character will break a
contraction, and thus any starter that is invisible could be used.

However, the purpose characters other than CGJ is not for contraction
breaking, and should be generally avoided because of side-effects (for
example, the purpose of the ZWNJ is for ligature control). SHY is a bit
special, since it has been 'traditionally' (in implementations) used for
contraction breaking.

> What do you think about this added paragraph, is it complete enough,
> shouldn't there other uses exhibited ?
> Note that contractions are normally not part of the DUCET, only part
> of advanced tailorings for specific languages or even just
> orthographies for specific dialects or bibliographic conventions.

Untrue. There are over 700 contractions in DUCET.

> But given that the CLDR does not use directly the DUCET in its "root"
> locale, but also tailors it a bit (using a few expansions), so that
> the pure DUCET-only collation (with only the default weights, no
> contractions, no expansions) requires also some tailoring rules
> compared to the collation implied in the "root" locale, this may
> affect the CLDR "root" locale as well, which could in fine define some
> contractions by default (I just wonder how such contraction found in
> an inherited locale can be undone in a subsequent tailoring rule for a
> sublocale).

Intentionally, changes are often first implemented in CLDR before being
incorporated into DUCET. The goal is to have only necessary differences in
the root locale for CLDR. The latest changes in DUCET allow for removal of
most root locale tailorings in CLDR.

There is a tailoring mechanism in CLDR for disabling root (DUCET)

> Another interesting question is: how can we encode in texts the fact
> that a character usually considered as a ligature in a language (that
> collates it as separate letters, even if the ligature is orthographic
> and not just typographic), should still be collated as only one letter
> ? In other words, are there some controls (or variant selection, or
> other means) which would have the effect of disabling the default
> expansions performed in a correctly tailored collation (for example,
> in a French collator is there a way to disable the expansion of
> occurences of "æ" into "ae" ?

The CLDR tailoring syntax allows DUCET expansions to be suppressed or
changed for a particular locale.

There is no mechanism in UCA to change expansions on a code-point basis. Eg
in the same string "Cæsium Kværner" to have the first æ expand to 'ae' but
the second sort after 'Z', as in Norwegian.

> Final note : when a complex specification document is modified within
> an existing section, but nothing is changed in the section titles, the
> TOC does not emphasize the fact that a section has been modified. We
> still need to look for the whole text to see which sections have been
> modified. In this proposed PRI, if we just look at the TOC, we could
> think that only one section was added. Shouldn't there be some
> editorial symbol (or additional annotations such as "(modified)") to
> designate explicitly in the TOC which sections contain modifications,
> to immediately go to the section that we are interested, or see morz
> easily if there are corellations or dependencies between those
> modifications ?

You should always look at the modifications section to see what has changed.

We are now using a uniform naming for the proposed versions of documents,
and we use the same anchor for the modifications, so you can always see them
with a URL of the following format:

What would be good is to supply links to sections in the bullets where
possible. Some of the TRs do that, but we are not consistent.

> -- Philippe.
Received on Wed Aug 31 2011 - 11:30:02 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 31 2011 - 11:30:06 CDT