Re: UCA and Russian letters YE/YO from Philippe Verdy on 2012-12-31 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 31 Dec 2012 12:01:50 +0100

My opinion is that you should not need ANY new character for this. And
the UCA algorithm does not need to be updated as well : preprocessing
can include a step that generates boundary conditions, and these
boundary conditions (or others) can be part of what is a "collation
element" (not just made of characters, it could be even a regular
expression or similar matching that boundary context).
But even if for now there's no support in LDML to supply additional
context to collation elements; if this was needed (I suspect it is
needed for Asian languages that don't have explicit word separations,
or for languages that have initial mutations), only LDML may be
extended to support such syntax, but UCA itself will continue to work
unchanged the way it is specified.
That's why we spoke about "markup" : the source text is transformed
into some preprocesssed rich-text syntax or format containing these
precomputed boundaries. This syntax (if based on a plain-text format
like XML) would of course reserve some characters for the syntax (e.g.
the lower-than sign and the ampersand, still allowing the encoding of
literal signs, for example with character entities in XML).
But it's more probable that even the XML format would not be the best
suited : an internal representation of the parsed plain-text is enough
(any regular expression engine that matches occurences in plain-texts
already generate contexts on the fly, which is still usable for
performing substitutions.
So we should not suggest ANY characters for such markup. Only the
language of the markup itself will reserve a few syntaxic characters
and will offer an alternate way to represent them litterally.

Just consider the LDML compact syntax for collation rules : the
lower-than symbol is assigned a role, as well as the space, but even
in this case we can represent a lower-than symbol or SPACE as litteral
characters for use within a collation element and not for defining the
order relations between collation elements in that language.

For LDML in XML syntax, there's already no need to defineany new
character : XML itself will offer a proper way to represent boundary
conditions as new elements inserted within a collation element.

2012/12/31 Leo Broukhis <leob_at_mailcom.com>
>
> On Wed, Dec 26, 2012 at 11:18 AM, Whistler, Ken <ken.whistler_at_sap.com> wrote:
> > Leo asked:
> >
> >> My question was narrower: assuming that the strings being compared are
> >> words, could it be supported without any markup?
> >
> > ... where "it" refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly indicated with some kind of a markup character, and then the sequence of that markup character plus the target character of interest (in this case Russian Yo) was given a tailored contraction in the weight table which weighted it differently from any Russian Yo not in that particular contraction sequence.
>
> I see your point: if something can be trivially emulated with a markup
> character, there is no need to augment the algorithm (emulating
> backward accents with markup is possible but much more cumbersome).
>
> What characters should be used for such markup, if need be?
Received on Mon Dec 31 2012 - 05:07:57 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 31 2012 - 05:08:00 CST