Re: DUCET and supplementary foldings (was: Looking for transcription or transliteration standards latin- >arabic)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 13 2004 - 01:23:36 CDT

Next message: Peter Kirk: "Re: Importance of diacritics"

Previous message: Alain LaBonté: "RE: Changing UCA primar[l]y weights (bad idea)"
In reply to: Asmus Freytag: "Re: Looking for transcription or transliteration standards latin- >arabic"
Next in thread: Mark Davis: "Re: Looking for transcription or transliteration standards latin- >arabic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Asmus Freytag" <asmusf@ix.netcom.com>
> I have a certain sympathy for the idea of designing UCA so that the
> untailored *default* works for such kind of multilingual usage. However,
> the other use of the DUCET is to be the most convenient base for applying
> all tailorings. I have a certain sympathy for the position that claims
that
> there are important, but perhaps specialized or not economically powerful
> classes of users that will not likely have access to a tailored UCA for
> their language or writing system.
>
> If that is really the case, i.e. appreciable numbers of smaller languages
> would be able to survive without tailoring, then the alternative to fixing
> the DUCET could be a separate publication of a common base tailoring for
> multilingual data access. (A base tailoring would be applied before
further
> tailoring for a specific language).

I appreciate much this analysis. The DUCET has effectively two supposed
usages, whose purposes are opposed. If used as a base collation from which a
language-specific collation can be built simply with few rules, it's true
that the other common usage needed for multilanguage searches is not easy to
build.

May be we could think about designing a new standard collation tailoring
table which could be used as an alternative to the DUCET, but targetting
multilanguage searches.

And so, such tailoring would include more folding than the DUCET, putting
the differences at a higher weight level. And give it a name (MUCET? for
Multilanguage Unicode Collation Elements Table?) that would be supported as
well.

The DUCET is now quite stable and there's no need to change it, as it is now
well known and certainly used in many applications that depend on it (RDBMS
engines notably). But a MUCET would be certainly useful, including for users
that would no more need to search for multiple words in a multilanguage
database or simply for the web. Nothing forbids, in addition, to sort the
matching entries by relevance using the DUCET as a secondary collation
order.

After all a collation elements table works exactly like a custom
decomposition table that creates additional strings whose encoding is not
portable as it depends on weight values. Using custom decompositions is
often much simpler than implementing a multilevel collation, using existing
algorithms implemented for NFD and NFKD decompositions. In such a view, some
extra decompositions are needed, using non-standard Unicode characters for
some elements (for example when decomposing a AE letter into a ligature with
an extra custom control with a higher collation level, to be used only for
full collation order but that could be ignored for searches limited at level
1 or 2).

Next message: Peter Kirk: "Re: Importance of diacritics"
Previous message: Alain LaBonté: "RE: Changing UCA primar[l]y weights (bad idea)"
In reply to: Asmus Freytag: "Re: Looking for transcription or transliteration standards latin- >arabic"
Next in thread: Mark Davis: "Re: Looking for transcription or transliteration standards latin- >arabic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 13 2004 - 01:25:47 CDT