Re: graphemes from Philippe Verdy on 2016-09-28 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 28 Sep 2016 12:41:07 +0200

2016-09-28 10:24 GMT+02:00 Christoph Päper <christoph.paeper_at_crissov.de>:

>
> My oldest quote is from Heller 1980, but I think it was introduced earlier
> (maybe by Gelb). McLaughlin 1963 proposes “graphoneme”. The terms are not
> very common, probably because everyone just uses their definition of
> “grapheme”.
>
> > *grapheme*
> > term intended to designate a unit of a writing system, parallel to
> phoneme and morpheme,
> > but in practice used as a synonym for letter, diacritic, character (2),
> or sign (2)
>

IMHO, the term grapheme only applies (traditionally) to the written
**form**, it.e. the **graphic** item which can be clearly separated from
others (even if there's some joining). So a grapheme may as well represent
several logical letters (as they are spelled orally), Some ligatues are
mandatroy in the written form of script and the grapheme represents the
sets of graphical varaitions that will be read the same in a language (in
fact what Unicode may also designate as "confusable characters".

So the grapheme for A does not really differentiate the Latin, Greek and
Cyrillic versions, even if, when analyzing them in a linguistic context,
these letters are read differently ("a" vs. "alpha", which is in fact not
really a distrinction of the script but on the linguistic tradition of
alphabets for as spelled for the vocal language), and the graphemes do not
have any case pairings, which is part of the semantic of the script as used
for the orthography of a given language. But in the vocal language the case
distinctions are almost always not relevant. The written form adds some
distinctions but still carying the initial semantic in the language. This
makes scripts (or more exactly writing systems) more complex to map within
a unified universal encoding.

Graphemes are then weaker definitions of what Unicode encodes as abstract
characters (to map on them additional properties that are not relevant at
the grapheme level but useful to parse the semantic of a complete text).

The abstract characters in Unicode do not distinguish some letter forms
even if traditionally the scripts and their associated writing systems for
a language make clear distinctions: a "Fraktur Latin" letter A is a
distinct "grapheme" from the modern cursive letter A even if they map to
the same Unicode abstract character (as a result of unification), but the
grapheme for the modern cursive letter A is the same between Latin,
Cyrillic and Greek scripts.

There are however significant differences when handling diacritics (e.g.
the diaeresis in German works very differently as an umlaut in the Fraktur
script than in the current modern script and really acts as a plain
distinct letter: the graphemes differences are exposed in this case even if
the Unicode-encoded letters unify them; and even logically when spelling
them vocally there's a clear difference between the diaeresis as used in
French or English and the umlaut used in German and several other
Central-European languages).

So I think that the term "grapheme" cannot be formally defined in Unicode,
it does not match anything with what's encoded. What is encoded is the
possibility to represent "grapheme clusters" (the set of graphical forms
which are minimally distinguished but not minimally separated in a specific
language) and map them with a sequence of Unicode-encoded "abstract
characters" (whose individual identity does not match exactly the
traditional graphemes, and are also detached from the perceived
distinctions of writing systems in a specific language).

Unicode cannot then define formally what is a "grapheme". It an only give a
definition of "grapheme clusters", but it is mostly based on its own
definitions of properties (which are also not sufficient to carry all
distinctions for any given language in its writing systems). "Grapheme
clusters" in Unicode are also not required to have a significant graphic
form, they purely exist at semantic level directly from their encoding and
can be used to generate other renderings (e.g. it can be rendered vocally,
aor used to derive some other semantics, such as values of numbers, word
breaking...) or to infer some grammatical/orthographic rules to compose or
generate other texts.

In summary, there's NO "grapheme" (isolately) in Unicode and I think it
should not be defined, it would break expectations on languages, and the
universal repertoire does not encode specific langauges and not even any
specific writing system (the scripts in Unicode are NOT writing systems,
which are always dependant of the language using them, and also dependant
on the epoch and geographic area of use, for their working
rules/conventions).

So the "grapheme" *may* be used (contextually) as a letter, a diacritic, a
sign, or even a ligature (the ligature is not just contextual when it is
mandated by the writing system and adds some semanctic distinctions,
depending on whever it is used or not, it's not just a question of "user
preferences" or "font styles"), or any combination of these, up to the
complete combination of what Unicode calls a "grapheme cluster" (the only
thing really encodable with one or more abstract characters).
Received on Wed Sep 28 2016 - 05:42:07 CDT

This archive was generated by hypermail 2.2.0 : Wed Sep 28 2016 - 05:42:08 CDT