Re: The Unicode Standard and ISO from Philippe Verdy via Unicode on 2018-06-08 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Fri, 8 Jun 2018 20:45:26 +0200

2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
unicode_at_unicode.org>:

> On Fri, 8 Jun 2018 13:40:21 +0200
> Mark Davis ☕️ <mark_at_macchiato.com> wrote:
>
> > Mark
> >
> > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> > unicode_at_unicode.org> wrote:
> >
> > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > > Marcel Schneider via Unicode <unicode_at_unicode.org> wrote:
> > >
> > > > Thank you for confirming. All witnesses concur to invalidate the
> > > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > > After being invented in its actual form, sorting was standardized
> > > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > > Algorithm, the latter including practice‐oriented extra
> > > > features.
> > >
> > > The UCA contains features essential for respecting canonical
> > > equivalence. ICU works hard to avoid the extra effort involved,
> > > apparently even going to the extreme of implicitly declaring that
> > > Vietnamese is not a human language.
>
> > A bit over the top, eh?
>
> Then remove the "no known language" from the bug list, or declare that
> you don't know SE Asian languages.
>
> The root problem is that the UCA cannot handle syllable by syllable
> comparisons; if the UCA could handle that, the correct collation of
> unambiguous true Lao would become simple. The CLDR algorithm provides
> just enough memory to make Lao collation possible; however, ICU isn't
> fast enough to load a collation from customisation - it takes hours!
> One could probably do better if one added suffix contractions, but
> adding that capability might be nightmare.

The way tailoring is designed in CLDR using only data used by a generic
algorithm, and not custom algorithm is not the only way to collate Lao. You
can perectly add new custom algorithm promitives that will use new
collation data rules that can be inserted as "hooks" in UCA (which provides
several points at which it is possible, but UCA just makes these hooks act
as "no-op".

You can be much faster is you create a specific library for Lao, that would
still be able to process the basic collation rules and then make more
advanced inferences based on larger cluster boundaries than just those
considered in the standard basic UCA, so it is perfectly possible to extend
it to cover more complex Lao syllables and various specific quirks (such as
hyphenation in the middle of clusters, as seen in some Indic scripts using
left matras).

Not everything has to be specified by UCA itself notably if it's specific
to a script (or sometimes only a single locale, i.e. a specific combination
of a script, language, orthographic convention, and stylistic convention
for some kinds of documents or presentations).
Received on Fri Jun 08 2018 - 13:46:05 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 08 2018 - 13:46:06 CDT