FW: New version of TR29:

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Aug 20 2002 - 05:03:24 EDT

[Resending a message I sent during the Unicode List downtime]

-----Original Message-----
From: Marco Cimarosti
Sent: Monday, August 19, 2002 12:55 PM
To: 'Philipp Reichmuth'
Cc: Unicode@unicode.org
Subject: RE: New version of TR29:

Philipp Reichmuth wrote:
> MC> Consonants [j] and [w] have the special status of "semivowels" in
> MC> romance languages, which means that they often behave as vowels
> MC> do, including in the rules for elision.
> One has to differentiate between phonemes and graphemes. Unicode, of
> course, operates on the grapheme level, and thus you simply can't be
> certain what a "y" actually stands for (vowel or semivowel)

As I understand it, the purpose of TR#29 is just segmenting text for the
sake of functions like moving the cursor to the next/previous "word". This
can be done reasonably well without any real linguistic analysis of the

If the string "l'yaourt" appears in the text, we are quite sure that "l'"
and "yaourt" are two different segments (two "words"). Whether the "y"
represents [j] or [i] is a matter for some different kind of analysis.

> MC> But, of course, I am aware that there are edge cases that will not
> MC> be captured in the general case. I have named one of these edge
> MC> cases (the Breton trigraph "c'h"), but it's not difficult to come
> MC> up with more -- e.g., when the apostrophe is used as a diacritic
> MC> applied to consonants (such as the Wade-Giles romanization of
> MC> Chinese "K'ang-hsi").
> Just to give another example: Uzbek in Latin script uses "o'" and "g'"
> as opposed to "o" and "g", such as in the language designation
> "O'zbek" where "o'" stands for the sound designated in Cyrillic script
> by U+040E and "g'" is equivalent to U+0493.

"O'zbek" would not split, because the apostrophe is not followed by "a",
"e", "i", "o", "u" or "y".

> MC> BTW, notice that I didn't include precomposed accented letters
> MC> because I understand UTR#29 works on NFD normalized text.
> Does NFD in this instance mean to include U+0080..00FF, i.e. the
> former Latin-1 upper block? It would be of interest to us Germans :-)

AFAIK, in NFD (Normalization Form Canonical Decomposition), only the
non-decomposable characters are preserved in range U+0080..U+00FF. All
letters with diacritics ("", "", "", etc.) are decomposed.

> MC> However, "ItalianFrenchVowel" doesn't include Esperanto, Occitan
> MC> and many Italian and French dialects.
> "RomanceVowel"? (Not a lot better.)

Indeed... Perhaps I should avoid any interference with phonetic terminology.
E.g. a rather meaningless label like "ElisionCausingLetter" could do.

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Aug 20 2002 - 03:14:31 EDT