Re: "A Programmer's Introduction to Unicode"

From: Janusz S. Bień <>
Date: Sun, 12 Mar 2017 07:04:56 +0100

On Fri, Mar 10 2017 at 19:55 CET, writes:
> I recently wrote
> , which sort of addresses the whole hangup programmers have with
> treating code points as "characters".


This is just another confirmation that the present Unicode terminology
is confusing. Let me remind below a fragment of an old thread about

Best regards


On Thu, Sep 15 2016 at 21:12 CEST, writes:
> On Thu, Sep 15 2016 at 16:36 CEST, writes:
> [...]
>> In the new Swift programming language, which is white-hot in the Apple
>> community, Apple is moving toward a model of a transparent, generic
>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>> but in which a “character” contains however many code points it needs
>> (“e” with a stacked macron, acute accent, and dieresis is
>> algorithmically one “character” in Swift). Moreover,
>> e-with-an-acute-accent and e followed by a combining acute accent, for
>> example, compare as equal. At present, the underlying code is still
>> UTF-16LE.
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
>, unfortunately without a proper
> definition. A rudymentary definition was provided for me only in my
> recent paper in Polish: It states simply
> (on p. 69) "an elementary text element independently of its Unicode
> representation" (meaning in particular composed vs precomposed). I still
> hope to formulate sooner or later a more satisfactory definition :-)
> I think Swift confirms that such a notion is really needed.
> Best regards
> Janusz

On Wed, Sep 21 2016 at 6:44 CEST, writes:
> On Tue, Sep 20 2016 at 18:09 CEST, writes:
>> Janusz Bień wrote:
>>> For me it means that Swift's characters are equivalence classes of the
>>> set of extended grapheme clusters by canonical equivalence relation.
>> I still hope we can come to some conclusion on the correct Unicode name
>> for this concept. I don't think non-Unicode interpretations of terms
>> like "grapheme" are grounds for throwing out "grapheme cluster,"
> I agree.
>> but I can see that the equivalence class itself is lacking a name.
> I'glad.
>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>> are identical entities, only that the language compares them as equal.
> I'm fully aware of this.
> Best regards
> Janusz

Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department),,
Received on Sun Mar 12 2017 - 00:05:31 CST

This archive was generated by hypermail 2.2.0 : Sun Mar 12 2017 - 00:05:33 CST