RE: New version of TR29:

From: Marco Cimarosti (
Date: Wed Aug 14 2002 - 13:52:21 EDT

[[ I feel very ashamed to confess that I have found more errors in the
rules. :-(((
My humblest apologies: version 3 is attached. I have re-read it a few times
and found no more problems. ]]

Philipp Reichmuth wrote:
> Hello Marco,
> Your definition of "LatinVowel" is problematic. Is "Y" only a vowel in
> French? In a word such as "yeux", it certainly is a consonant. Could
> this lead to problems?

I don't think so, but I wait for the opinion of French speakers.

What I can see is that things like "l'yaourt" [lja'ur] are normal in French
spelling, and sometimes are to be found also in Italian ("l'yoghurt"

Consonants [j] and [w] have the special status of "semivowels" in romance
languages, which means that they often behave as vowels do, including in the
rules for elision.

In fact, I wondered whether also "J" and "W" should be included, to catch
some old Italian usages like "l'Jugoslavia" or "l'whisky".

> Defining such classes has the problem that they easily appear too
> general. The mere name "LatinVowel" looks too much like this class was
> supposed to contain all vowels of the Latin script regardless of
> language, but these wouldn't obviously be limited to your selection.
> You have to make this really clear. It is *so* tempting to assume that
> these are all the possible vowels that somebody is probably going to
> do it and base some completely non-apostrophe-related algorithm on it,
> just because he can easily extract this information from some Unicode
> data.

I assumed that only those few vowels would be used in French or Italian, in
combination with elision. That's why I have excluded things like Dutch "IJ"
or IPA vowels. Including these vowels is bringing no benefit to French or
Italian, but risks to collide with some unanticipated usage in some other

But, of course, I am aware that there are edge cases that will not be
captured in the general case. I have named one of these edge cases (the
Breton trigraph "c'h"), but it's not difficult to come up with more -- e.g.,
when the apostrophe is used as a diacritic applied to consonants (such as
the Wade-Giles romanization of Chinese "K'ang-hsi").

This is also true (and accounted for) with the current definition of the
UTR, but I found that the ubiquitous French and Italian "l'" and "d'", etc.
cannot be seen as "edge cases".

BTW, notice that I didn't include precomposed accented letters because I
understand UTR#29 works on NFD normalized text.

> Better name them something less potentially misleading like
> ItalianFrenchVowel, if you need this character class - it also better
> reflects the purpose of the thing.

That's is fine. I just wanted to suggest a possibility, not to substitute
the UTC's work in defining the precise wording of their documents.

However, "ItalianFrenchVowel" doesn't include Esperanto, Occitan and many
Italian and French dialects.

_ Marco

This archive was generated by hypermail 2.1.2 : Wed Aug 14 2002 - 11:48:08 EDT