Re: "A Programmer's Introduction to Unicode"

From: Manish Goregaokar <manish_at_mozilla.com>
Date: Sun, 12 Mar 2017 11:43:22 -0700

> This is just another confirmation that the present Unicode terminology
is confusing.

I find this to be a symptom of our pedagogy around "characters" in
programming; most folks get taught that characters are bytes are code
points, especially because many languages try to make this the case.
The name "grapheme cluster" could be improved upon, but it's not the
primary source of this confusion.
-Manish

On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bień <jsbien_at_mimuw.edu.pl> wrote:
> On Fri, Mar 10 2017 at 19:55 CET, manish_at_mozilla.com writes:
>> I recently wrote
>> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
>> , which sort of addresses the whole hangup programmers have with
>> treating code points as "characters".
>
> [...]
>
> This is just another confirmation that the present Unicode terminology
> is confusing. Let me remind below a fragment of an old thread about
> "textels".
>
> Best regards
>
> Janusz
>
>
> On Thu, Sep 15 2016 at 21:12 CEST, jsbien_at_mimuw.edu.pl writes:
>> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy_at_gmail.com writes:
>>
>> [...]
>>
>>> In the new Swift programming language, which is white-hot in the Apple
>>> community, Apple is moving toward a model of a transparent, generic
>>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>>> but in which a “character” contains however many code points it needs
>>> (“e” with a stacked macron, acute accent, and dieresis is
>>> algorithmically one “character” in Swift). Moreover,
>>> e-with-an-acute-accent and e followed by a combining acute accent, for
>>> example, compare as equal. At present, the underlying code is still
>>> UTF-16LE.
>>
>> For several years I use the name "textel" (text element, in Polish
>> "tekstel") for such objects. I do it mostly orally in my presentations
>> for my students, but I used it also in writing e.g. in
>> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
>> definition. A rudymentary definition was provided for me only in my
>> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
>> (on p. 69) "an elementary text element independently of its Unicode
>> representation" (meaning in particular composed vs precomposed). I still
>> hope to formulate sooner or later a more satisfactory definition :-)
>>
>> I think Swift confirms that such a notion is really needed.
>>
>> Best regards
>>
>> Janusz
>
> On Wed, Sep 21 2016 at 6:44 CEST, jsbien_at_mimuw.edu.pl writes:
>> On Tue, Sep 20 2016 at 18:09 CEST, doug_at_ewellic.org writes:
>>> Janusz Bień wrote:
>>>
>>>> For me it means that Swift's characters are equivalence classes of the
>>>> set of extended grapheme clusters by canonical equivalence relation.
>>>
>>> I still hope we can come to some conclusion on the correct Unicode name
>>> for this concept. I don't think non-Unicode interpretations of terms
>>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>>
>> I agree.
>>
>>> but I can see that the equivalence class itself is lacking a name.
>>
>> I'glad.
>>
>>>
>>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>>> are identical entities, only that the language compares them as equal.
>>
>> I'm fully aware of this.
>>
>> Best regards
>>
>> Janusz
>
>
> --
> ,
> Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> jsbien@uw.edu.pl, jsbien@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
>
Received on Sun Mar 12 2017 - 13:44:31 CDT

This archive was generated by hypermail 2.2.0 : Sun Mar 12 2017 - 13:44:32 CDT