Re: UCA and Russian letter Ё from Leo Broukhis on 2012-12-21 (Unicode Mail List Archive)

From: Leo Broukhis <leob_at_mailcom.com>
Date: Fri, 21 Dec 2012 08:32:55 -0800

[Philippe tells me that his message that I'm quoting could have been
rejected by the mailing list as spam; my answer is below.]

On Fri, Dec 21, 2012 at 5:13 AM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
> This is an interesting case. A solution would be to be able define a
> distinct collation element for "^ë", where "^" means "begining of a word"
> (even if there's no character encoded there). That element would be such
> that :
>
> e << ë < ^ë
>
> But this requires a prior definition of word boundaries to recognize the "^"
> as an additional collation element by itself (usable distinctly only in
> context, and ignored when it occurs anywhere else, meaning that all weights
> assigned to "^" alone would be null.)
>
> So "^ë" would become valid as a collation element, but "т^ё" makes no sense
> if there's no possible word boundary between "т" and "ё".
>
> This would work with the UCA algorithm, which does not really mandate what
> is a "collation element" (not only in terms of encoding as characters), or
> any syntax to support it.
>
> This mechanism of incorporating word boundaries in UCA would be an
> interesting extension for section 6.9 (Handling Collation Graphemes) of
> UTS#10 (but for now there's no support for it in LDML with a defined syntax
> allowing the insertion of boundaries or other contextual conditions).

Would it also mean that using a CGJ at the beginning of a word will
cause a ё at the beginning of a word to be treated as a mid-word one?
Is "space, CGJ" a well-formed character sequence?

Leo
Received on Fri Dec 21 2012 - 10:36:33 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 21 2012 - 10:36:35 CST