Re: Character folding in text editors from Philippe Verdy on 2016-02-20 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 21 Feb 2016 00:19:19 +0100

It should also be noted that some kind of "folding" described/desired by
Elias will likely fail his expectations, even when using collation data in
CLDR tailored per language.

Notably, this data, even if it is used as it weakest strength (the primary
collation level only, discarding other differences at higher strength
levels) will most often not collate many digrams/trigrams that are
frequently used in the locale for which the data is designed. The reason
for that is that most of these digrams/trigrams (used in the orthography to
note a single phoneme) are highly context-dependant and could in fact cover
several distinct phonemes.

E.g. "on" in French is a digram for the nasal o. There are also mute
letters (consonnants) following it in the same phoneme. But if the
consonnant is followed by a vowel, then there's a posible syllable break
between "on" and the following consonnant. However that vowel may also be
mute (if it is a final "e"), in which case there's a single syllable.. If
the digram "on" is followed by a vowel, it is no longer a digram and
there's a syllable break between "o" and "n", but if "on" is followed by a
mute vowel (final "e"), that syllable break disappears, but the digram "on"
is still two distinct phonemes. "on" may also be followed by another "n"
and a vowel (possibly mute) it which case "on" is never a single phoneme.

There are similar issues with other digrams/trigrams in French such as
"ein", "aint", un". Some distinct difficulties with "gu", "ge" and "qu".
And more difficultes with "ch" (also in English and other languages).
Different difficulties with "ai"...

Determining which digrams/trigrams are a single phoneme requires parsing
words for syllable breaks. But there are many exceptions (notably because
languages are borrowing lots of words from other languages with their
origin orthography, and the phonetic is only slightly altered.

There exists some algorithms trying to use those weak "equivalences", based
on their apparent orthography, trying to infer some basic phonetic from it.
This is used for performing approxiamte searches in arbitrary plain text,
even in cases where there may exist some orthographic typos in it. Look for
example at the SOUNDEX function (you'll first need to detect word-breaks
for some implementations).

Trying to use dictionary data for determining the syllable breaks may be
useful, but you need a lot of data (and all dictionaries are incomplete).
For disambituating some cases, you'll need to determine in fact the actual
phonetics by using a phonetic dictionary (data resources for that are
difficult to find, even serious linguistic dictionnaries only include a
part of the phonetic, and ignore the variants for derived orthographic
forms)

2016-02-20 22:43 GMT+01:00 Doug Ewell <doug_at_ewellic.org>:

> Eli Zaretskii wrote:
>
> What about language-independent character-folding: where in the
>> Unicode database is the data for that?
>>
>
> The OP kind of alluded to that: there is no such thing really as
> language-independent character folding.
>
> About the closest approximation you can get using Unicode data alone (not
> CLDR) is to normalize to NFD, then ignore the combining diacritics. But
> that still doesn't work for a character like ø, which doesn't decompose to
> o + anything, and more importantly, it still won't meet expectations
> because of the n/ñ and o/ö/ø language-dependency problems.
>
> As Mark and Philippe said, the real solution is to use CLDR, because that
> is where language-dependent information like this lives.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>
Received on Sat Feb 20 2016 - 17:20:47 CST

This archive was generated by hypermail 2.2.0 : Sat Feb 20 2016 - 17:20:47 CST