FW: New version of TR29:

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Aug 20 2002 - 05:04:25 EDT


[Resending a message I sent during the Unicode List downtime]

-----Original Message-----
From: Marco Cimarosti
Sent: Monday, August 19, 2002 7:03 PM
To: 'Eric Muller'; unicode@unicode.org
Subject: RE: New version of TR29:

Eric Muller wrote:
> > > Your definition of "LatinVowel" is problematic. Is "Y" only a
> vowel in
> > > French? In a word such as "yeux", it certainly is a
> consonant. Could
> > > this lead to problems?
> >
> > I don't think so, but I wait for the opinion of French speakers.
> >
> > What I can see is that things like "l'yaourt" [lja'ur] are
> normal in
> > French
> > spelling, and sometimes are to be found also in Italian
> ("l'yoghurt"
> > ['ljogurt]).
>
> "y" is either a vowel or a semi-consonant.

Terminology varies here, especially across different languages. I tend to
avoid "semiconsonant" because, in some contexts, it has a very different
meaning. E.g., some Italian grammars use the term for [r] and [l].

> When a semi-consonant, an
> initial "y" does not cause elision, so "le yaourt".

But I notice that "l'yaourt" is also used. However, this is not an issue: a
string like French "le yaourt" clearly doesn't pose any segmentation
problem, as the two words are separated by a space.

The issue is with strings like "l'yaourt", "l'yeuse" or "l'ypérite" (using
your examples): if you don't include "y" in the "LatinVowels" class, they
would all be considered as single words.

However, for the purpose of TR#29, it is not particularly useful to know
whether a "y" is a vowel or semivowel, as far as it is preceded by "d'",
"l'", etc..

> Of course, there are
> exceptions: "yeuse" (oak), "yèble" (?) and "yeux" (eyes). The usage is
> both ways for "yole" (skiff). There are a few words starting with a
> vowel "y": "y" (there), "ypérite" (mustard gas), "ytterbium" (?),
> "yttrium" (?). Finally, there is elision before most proper nouns
> starting with "Y": "Yonne" (a river), "York", etc.
>
> That being said, here are a few problematic cases for your proposal:
>
> "prud'homme" (a member of an industrial tribunal) is a single word, as
> are his relatives "prud'homal", and "prud'homie".

Neither the current definition nor my proposed modification claim success on
the 100% of cases.

The issue is making the error window as narrow as possible. My assumption is
that is common words such as "c'", "d'", "j'", "l'", "n'", "qu'", "s'", "t'"
or "v'" are more common than edge cases like "prud'homme".

E.g., "l'" (= "la", French "le", Italian "lo"), "d'" (= French "de", Italian
"di") are among the top-ranking occurrences in any French or Italian text.

> Grevisse ("Le bon usage", "the" authority on French usage) gives five
> verbs which are considered a single word: "entr'aimer (s')",
> "entr'apercevoir", "entr'appeler (s')", "entr'avertir (s')",
> "entr'égorger (s')"; Le Petit Robert (1988, a well respected
> dictionary)
> gives only the second one.
>
> There is elision before the names of the consonants f, h, l,
> m, n, r, s,
> x: "admissible à l'X" (accepted at X = École Polytechnique), "devant
> l'n" (before the n).
>
> "grand'mère" is definitely one word for me, but "grand'rue",
> "grand'chose" are not so clear. All are archaic forms and Le Petit
> Robert does not list any of those (modern: "grand-mère", "rue
> principale", "grand chose"').
>
> Then there is spoken French: "j'suis allé m'promener" for "je
> suis allé
> me promener" (I went for a walk). There are many such cases of elision
> before a consonant.
>
> This spoken French is of course very close to many dialects, or even
> close languages (e.g. Picard, spoken in the North of France).

All these edge cases can only be captured in a more sofisticated
*language-dependent* algorithm.

In a simple generic language-independent definition, you have to sacrifice
some cases. The problem is deciding which cases we accept to sacrifice. I
currently see three scenarios:

1) The current English-biased handling (apostrophe never break):
        Correct: "cromlec'h", "don't", "K'ang-hsi", "O'zbek",
"prud'homme";
        Incorrect: "d'homme", "j'aime", "j'suis", "l'Angleterre",
"l'X";

2) A possible French-biased handling (apostrophes always break):
        Correct: "d'homme", "j'aime", "j'suis", "l'Angleterre",
"l'X";
        Incorrect: "cromlec'h", "don't", "K'ang-hsi", "O'zbek",
"prud'homme";

3) My proposed English/French-biased modification (apostrophes break
euristically):
        Correct: "d'homme", "don't", "j'aime", "l'Angleterre",
"O'zbek";
        Incorrect: "cromlec'h", "j'suis", "K'ang-hsi", "l'X",
"prud'homme";

> Did we mention that one never breaks a line after an apostrophe that
> represents elision?

There is a separate techical report (UAX#14) for line breaks.

As far as I understand, TR#29 only defines boundaries between elements like
"graphemes", "words" and "sentences". I guess these boundaries would be used
by functions such as "Select letter/word/sentence at cursor", "Count
letters/words/sentences in document", etc.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Aug 20 2002 - 03:14:00 EDT