From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Mar 06 2004 - 10:59:38 EST
From: "Peter Kirk" <peterkirk@qaya.org>
> > Sindhi *does* have a distinction between two "kaf" characters, as it
> > writes unaspirated /k/ and aspirated /kh/ with distinct characters
> > (not using a digraph for /kh/, as Urdu does). The plain /k/ is written
> > with a "swash kaf" form (see U+06AA), while /kh/ is written with a
> > "keheh" form (U+06A9). So there is a clear need for a plain-text
> > distinction between two "kaf" letters in Sindhi.
>
>
> Now I know that "swash kaf" is not the same as kaf, so the situation is
> not as simple as I had remembered. But the point remains that letters
> which seem to be graphical variants in one language may in fact be
> distinct letters in another language. That is one good reason to avoid
> hasty unification of characters. I don't think it applies to your
> various th ligatures, but then there could well be a dictionary out
> there somewhere which uses one of your supposedly equivalent ligatures
> for the voiced th and another one for the unvoiced th.
Don't we have a similar (but reversed) issue with the "oe" ligature in French,
where it is considered a glyphic variant (normally mandated by the correct
French typographic rules, which are sometimes considered also as orthographic)
of the two letters "o" and "e", where other languages consider "oe" as a
separate distinct letter?
Unicode chose not to unify "oe" with existing two letters, even if its the
normal presentation form for French when these two letters are written
side-by-side, unless there's an accent on the e (for example "coéquipier"), and
a few exceptions like "coexister", "coefficient", and "coercitif" which
historically where written with a "tréma" (diaeresis) on the e to avoid this
ligature; modern French has mostly removed nearly all such "tréma" except in
words like "Noël", simply because the ligature is now only used to ligate "o"
with "eu" where the "o" is assimilated within the "eu" which is just pronounced
longer as in "coeur" or "boeuf" or "choeur". So the tréma remains in "Noël",
"Joël", "Boël" which tends now to be written in some places (notably in proper
names) with a grave accent or by adding a separation with an unvoiced "h".
Same problem with the "ae" ligature in French which is the normal form for "a"
plus "e" without an accent, and is just pronouced as a long "é" with the Roman
Latin pronunciation (where the "a" is assimilated with the following "e"
pronounced "é" when it is a Roman Latin word read in French).
One difficulty of desunifying characters is that it creates an additional
semantic or orthographic difference which does not occur in one language, but
may exist in other languages. So desunifying creates cases where additional
collation rules are needed to correctly represent the language where the Unicode
distinction is not significant (here for French "oe", but tomorrow as well for a
"swash kaf" variant of "kaf" in non-Sindhi languages).
Are there similar issues with scripts that contain a lot of letter form variants
(notably Arabic), which may be considered as equivalent in most cases but
distinct in other cases ? I know about the case of Devanagari, and how it was
solved by encoding explicitly format characters to control the letter form in
grapheme clusters.
This archive was generated by hypermail 2.1.5 : Sat Mar 06 2004 - 11:31:51 EST