RE: FW: New version of TR29:

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Aug 20 2002 - 11:05:37 EDT


Philipp Reichmuth wrote:
> MC> "O'zbek" would not split, because the apostrophe is not
> followed by "a",
> MC> "e", "i", "o", "u" or "y".
>
> "G'iyosaddin" would (sorry for the silly word, it's the middle name of
> a medieval poet, but it's the first thing that came into my mind, and
> "g'" is not such a rare combination in Uzbek that this is the only
> case).
>
> You can't sensibly base a general-purpose word splitting
> algorithm on the French and Italian definition of "vowel".

It depends on what you mean by "sensibly".

To me, privileging French and Italian over Uzbek is a sensible choice. Text
written in English, or Swedish, or Chinese is more likely to contain French
quotations than Uzbek quotations.

> It is probably impossible to do that without looking at the language
> of your encoded string.

Definitely so. I thought that this was out of discussion, as DUTR#29 itself
clearly states that the general algorithm needs to be *tailored* for each
language.

An Uzbek tailoring will contain special rules for "G'" and "O'", which will
override the general mechanism.

But, out of the cases covered by the Uzbek-specific tailored rules, also
Uzbek text needs to follow the default rules, in order for terms or
quotations from other languages to be handled correctly in the *majority* of
cases...

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Aug 20 2002 - 09:14:57 EDT