Re: "textels" from Philippe Verdy on 2016-09-15 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 16 Sep 2016 05:41:23 +0200

2016-09-15 21:56 GMT+02:00 Janusz S. Bień <jsbien_at_mimuw.edu.pl>:

> On Thu, Sep 15 2016 at 21:27 CEST, eliz_at_gnu.org writes:
>
> [...]
>
> > Isn't "grapheme cluster" the definition you are looking for?
>
> I don't think so.
>
> However:
>
> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>

Your definition of textels is also language dependant, as you are reading
it from a Polish point of view.
However you are confusing here "graphemes" with "grapheme clusters".

Your (Polish) textels are in fact the same as the (Polish) grapheme
clusters.

Unicode also defines "default grapheme clusters" that are "grapheme
clusters" not tailored for a particular language. A "default grapheme
clusters" is the minimum unbreakable unit that can be seen as a valid
"grapheme cluster" in most languages (or at least in most languages using
the same base script if the script is used in that language; in other
scripts, it just provides a minimum compatibility level to allow insertion
of foreign texts in a multilingual document).

The grapheme clusters can then be used to parse text and apply various
processes such as

  - normalization : grapheme clusters are not broken by it and can be
compared for canonical equivalences (but you can compare smaller units
using only the combining class property by breaking text on characters with
CC=0 and handling the special algorithmic case of modern Hangul syllables;
see the Unicode standard about normalization)
  - BiDi layout
  - line breaking
  - word breaking
  - most standard text transforms (such as case folding)
  - transliteration

Rendering text however often requires larger units as successive grapheme
clusters (if not split by a line break or by BiDi reoredring) will interact
visually to create more complex layouts (notably in Indic scripts), glued
together by some controls (notably joining controls); they are also
compelxified in some cases where combining classes alone cannot properly
represent these interactions.

Additionnally for a few cases, the visual order is used for encoding text
instead of the standard model using the logical order: this was made to
preserve the roundtrip compatibility between Unicode and legacy encodings
widely used (notably for the Thai script). However this has a known caveat
(which already existed before Unicode) for some algorithms such as word
breaking (implementaitons need to implement a lookup dictionnary, but in
Thai this dictionnary is not very large) and line breaking (if we don't
want to break words or in the middle oif syllables). The default grapheme
clusters however will correctly break the text to allow Thai text (encoded
in visual order) to be rendered correctly.

In summary, the concept of "grapheme clusters" must be read and understood
in the Unicode standard only as a Unicode terminology used to describe all
other algorithms described in the standard. They are not bound to a
particular language except if thsi language is explicitly specified with
this term in that case we won't be handling the "default grapheme clusters"
rules but the additional rules tailoring the basic rules used to define the
default grapheme clusters.

The "extended grapheme clusters" are used in context requiring more complex
algorithms that need to group several grapheme clusters in a ordered
sequence. These algorithms require some text buffering, and parsing from a
random position in text may require looking backward on larger lengths to
determine the context. Parsing text sequentially also requires keeping some
additional context variables. Plain text searches based on "extended
grapheme clusters" is also much more challenging than searches on "default
grapheme clusters".

For these reasons, the "extended grapheme clusters" are not defined in
"default grapheme clusters" but will be needed for matching user
expectations in particular languages or scripts. You normally don't need
any "extended grapheme clusters" in Polish, except in multilingual
documents that are embedding some non-Latin scripts, or some technical
notations.

> 2. Textel "ń" means both U+0144 and <U+006E,U+0301>, so it is a notion
> on a higher abstraction level then a grapheme cluster.
>
> Moreover I don't want to call <U+006E,U+0301> (LATIN SMALL LETTER N,
> COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2
> reasons:
>
> 1. there is nothing extended in it
>

This <U+006E,U+0301> combination is first a "grapheme cluster", before
being also an "extended grapheme cluster" in Unicode terminology.

The term "extended" comes from an extension added not for the case of
combining chacters encoded after base characters (or combined to them in a
canonically equivalent string), but for other extensions, notably for
complex syllabic constructs:

Every "grapheme cluster" may also be an "extended grapheme cluster", but
the reverse is NOT true.

You have to read the standard about the various kind of text breaking
processes.

> 2. U+0301 is not a grapheme according to Polish linguistics terminology
>

The Polish lingusitics uses its own Polish term, not "grapheme" which is in
the standard what is defined there in English, but for being the base of
other definitions needed for parsing texts in various languages.

In Unicode U+0301 would be a grapheme, but if used in isolation it would
not form a complete grapheme cluster, but a defective grapheme cluster as
it lacks its base with which it should be associated and encoded before it
(that base cannot be a non-character or a control, even if these are
blockers against reordering for normalization processes and canonical
equivalences, and cannot be another combining character)
Received on Thu Sep 15 2016 - 22:42:19 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 15 2016 - 22:42:19 CDT