Re: "textels" from Janusz S. Bień on 2016-09-16 (Unicode Mail List Archive)

From: Janusz S. Bień <jsbien_at_mimuw.edu.pl>
Date: Fri, 16 Sep 2016 15:52:26 +0200

On Thu, Sep 15 2016 at 21:56 CEST, jsbien_at_mimuw.edu.pl writes:

[...]

> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>
> 2. Textel "ń" means both U+0144 and <U+006E,U+0301>, so it is a notion
> on a higher abstraction level then a grapheme cluster.

In other words, textels are equivalence classes of some set of Unicode
characters strings by an equivalence relation which at the moment is
open to the discussion but is very close to the official Unicode
canonical equivalence (when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).

[...]

On Thu, Sep 15 2016 at 21:27 CEST, leoboiko_at_namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"?

As for the Swift "character", perhaps someone fluent in Swift will answer
the question?

> (Well, technically UAX
> #29[1] defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> algorithmically).
>
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
>
> https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html

Thank you very much for the link. Let me quote the relevant fragment:

--8<---------------cut here---------------start------------->8---

Extended Grapheme Clusters

Every instance of Swift’s Character type represents a single extended
grapheme cluster. An extended grapheme cluster is a sequence of one or
more Unicode scalars that (when combined) produce a single
human-readable character.

Here’s an example. The letter é can be represented as the single Unicode
scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same
letter can also be represented as a pair of scalars—a standard letter e
(LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE
ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically
applied to the scalar that precedes it, turning an e into an é when it
is rendered by a Unicode-aware text-rendering system.

In both cases, the letter é is represented as a single Swift Character
value that represents an extended grapheme cluster. In the first case,
the cluster contains a single scalar; in the second case, it is a
cluster of two scalars:

[...]

*Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent.*

--8<---------------cut here---------------end--------------->8---

For me it means that Swift's characters are equivalence classes of the
set of extended grapheme clusters by canonical equivalence relation.

> Such a notion is indeed needed, but it has been always there.
>
> [1] http://unicode.org/reports/tr29/

I don't see there a notion of such equivalent classes.

On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy_at_gmail.com writes:

[...]

> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic
> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
> but in which a “character” contains however many code points it needs
> (“e” with a stacked macron, acute accent, and dieresis is
> algorithmically one “character” in Swift). Moreover,
> e-with-an-acute-accent and e followed by a combining acute accent, for
> example, compare as equal. At present, the underlying code is still
> UTF-16LE.

If you insist that Swift's "character" are just grapheme clusters, than
you add different, although related, meaning to the term "grapheme
cluster". I think the notion deserves a term of its own.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien@uw.edu.pl, jsbien@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Received on Fri Sep 16 2016 - 08:52:52 CDT

This archive was generated by hypermail 2.2.0 : Fri Sep 16 2016 - 08:52:52 CDT