PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

From: Philippe Verdy <>
Date: Wed, 31 Aug 2011 04:20:29 +0200

In the proposed update of UTS#10 (UCA), subject to the PRI #203 just
posted, I note the following addition in section 3.3.2 (Contractions).

"Characters of a contraction can be made to sort as separate
characters with the insertion of any starter character. There are two
characters, soft hyphen and U+034F COMBINING GRAPHEME JOINER that are
particularly useful for this purpose. These can be used to separate
contractions that would normally be weighted as units, such as Slovak
ch or Danish aa. For more information, see Section 5.3 Use of
Combining Grapheme Joiner."

However, in a past discussion here in this Unicode miling list, we
discussed heavily about the fact that SOFT HYPHEN would not be
appropriate to avoid contractions as it would also imply a break
opportunity, which may be undesirable when the only intended thing was
to prohibit things like contractions (for collation), or even

Note also there's another collation-ignorable character for that,
ZWNJ, which does not imply a break opportunity. But it is not clear if
it implies that this also avoids the contraction, for collation

What do you think about this added paragraph, is it complete enough,
shouldn't there other uses exhibited ?

Note that contractions are normally not part of the DUCET, only part
of advanced tailorings for specific languages or even just
orthographies for specific dialects or bibliographic conventions.

But given that the CLDR does not use directly the DUCET in its "root"
locale, but also tailors it a bit (using a few expansions), so that
the pure DUCET-only collation (with only the default weights, no
contractions, no expansions) requires also some tailoring rules
compared to the collation implied in the "root" locale, this may
affect the CLDR "root" locale as well, which could in fine define some
contractions by default (I just wonder how such contraction found in
an inherited locale can be undone in a subsequent tailoring rule for a

Another interesting question is: how can we encode in texts the fact
that a character usually considered as a ligature in a language (that
collates it as separate letters, even if the ligature is orthographic
and not just typographic), should still be collated as only one letter
? In other words, are there some controls (or variant selection, or
other means) which would have the effect of disabling the default
expansions performed in a correctly tailored collation (for example,
in a French collator is there a way to disable the expansion of
occurences of "" into "ae" ?

Final note : when a complex specification document is modified within
an existing section, but nothing is changed in the section titles, the
TOC does not emphasize the fact that a section has been modified. We
still need to look for the whole text to see which sections have been
modified. In this proposed PRI, if we just look at the TOC, we could
think that only one section was added. Shouldn't there be some
editorial symbol (or additional annotations such as "(modified)") to
designate explicitly in the TOC which sections contain modifications,
to immediately go to the section that we are interested, or see morz
easily if there are corellations or dependencies between those
modifications ?

-- Philippe.
Received on Tue Aug 30 2011 - 21:26:03 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 30 2011 - 21:26:08 CDT