Re: Possible 'Normal' Nested Contractions for Collation

From: Markus Scherer <>
Date: Tue, 8 May 2012 09:05:49 -0700

On Tue, May 8, 2012 at 5:16 AM, Wordingham, Richard (UK) <> wrote:

> **
> I think I may have an example. Back in 2005, Michael Everson proposed the
> encoding of TIBETAN LETTER KHHA and TIBETAN LETTER GHHA, for use in
> the Balti language. I cannot find any challenge to the concept that they
> are regarded as letters by their users. The reason they were not encoded
> is that encoding as <U+0F41 TIBETAN LETTER KHA, U+0F39 TIBETAN MARK TSA
> -PHRU> and <U+0F42 TIBETAN LETTER GA, U+0F39> already provides the
> functionality, as Andrew West pointed out in
> . Now, if writing Balti
> (or even Arabic in transcription) uses long vowels, and these proposed
> characters are full-blown letters in Balti, one will want to sort KHHII as
> KHHA followed by II, although the sequence normalises as <U+0F41, U+0F71,
> U+0F72, U+0F39>.
> I appreciate that this example is uncertain. It seems that the Tibetan
> language treats 'new letters' made with TSA -PHRU as being old letter
> plus diacritic and therefore sorting within the base letters, in which case
> it seems that misassociating it on the vowel does not actually adversely
> affect collation.
> Please reply via the Unicode or Unicore list.
> Richard.

Replying to the unicode list.

The context is a discussion of whether it is necessary in the UCA
(collation) spec to support interleaved contractions: Contractions a+b and
x+y where NFD(abxy)=axby and ccc(x)<ccc(b).

The general question is whether any properly spelled text in any language
involves such "axby" sequences in the Unicode encoding model.

This Balti example appears relevant only if the Balti (or Tibetan) encoding
model in Unicode would add further pairs of combining marks like <U+0F71,
U+0F72>, without any intervening base letter (zero combining class).


Google Internationalization Engineering
Received on Tue May 08 2012 - 11:11:29 CDT

This archive was generated by hypermail 2.2.0 : Tue May 08 2012 - 11:11:30 CDT