Re: FW: New version of TR29:

From: Philipp Reichmuth (uzsv2k@uni-bonn.de)
Date: Tue Aug 20 2002 - 09:26:35 EDT


>> Just to give another example: Uzbek in Latin script uses "o'" and "g'"
>> as opposed to "o" and "g", such as in the language designation
>> "O'zbek" where "o'" stands for the sound designated in Cyrillic script
>> by U+040E and "g'" is equivalent to U+0493.

MC> "O'zbek" would not split, because the apostrophe is not followed by "a",
MC> "e", "i", "o", "u" or "y".

"G'iyosaddin" would (sorry for the silly word, it's the middle name of
a medieval poet, but it's the first thing that came into my mind, and
"g'" is not such a rare combination in Uzbek that this is the only
case). You can't sensibly base a general-purpose word splitting
algorithm on the French and Italian definition of "vowel".

It is probably impossible to do that without looking at the language
of your encoded string.

  Philipp mailto:uzsv2k@uni-bonn.de
___________________
With searching comes loss / and the presence of absence / The data, not found



This archive was generated by hypermail 2.1.2 : Tue Aug 20 2002 - 08:00:54 EDT