RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Dec 09 2003 - 13:14:03 EST

Next message: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"

Previous message: Peter Kirk: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Next in thread: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Maybe reply: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Maybe reply: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Hmm. Now here's some C++ source code (syntax colored as
> Philippe suggests, to imply that the text editor understands
> C++ at least well :enough to color it)
>
> int n = wcslen(L"café");
>
> (That's int n = wcslen(L"café"); for those without HTML email)
>
> The L prefix on a string literal makes it a wide-character
> string, and wcslen() is simply a wide-character version of
> strlen(). (There is no guarantee that "wide character" means
> "Unicode character", but let's just assume that it does, for
> the moment).

Even assuming that you can assume that "wide characters" are Unicode, you
have not yet assumed in what kind of UTF they are. (Don't assume I
deliberately making calembours :-)

The only thing that the C(++) standards say about type "wchar_t" is that it
is not smaller that type "char", so a "wide character" could well be a byte,
and a "wide character string" could well be UTF-8, or even ASCII.

> So, should n equal four or five?

Why not six?

If, in our C(++) compiler, type "wchar_t" is an alias for "char", and "wide
character strings" are encoded in UTF-8, and the "é" is decomposed, then n
will be equal to 6.

> The answer would appear to depend on whether or not the
> source file was saved in NFC or NFD format.

The answer is:

int n = wcslen(L"café");

That's why you take the burden to call the "wcslen" library function rather
than assuming a hard-coded value such as:

int n = 4; // the length of string "café"

> There is more to consider than just how and whether a text
> editor normalizes.

Whatever the editor does, what if then the *compiler* normalizes it?

The source file and the compiled object file are not necessarily in the same
encoding and/or normalization.

A certain compiler could accept a certain range of input encodings (maybe
declared with command-line parameter) and convert them all in a certain
internal representation in the compiler object file (e.g., Unicode expressed
in a particular UTF and with a particular normalization).

That's why library functions such as "strlen" or "wcslen" exist. You don't
need to bother what these functions will return in a particular compiler or
environment, as far as the following code is guaranteed to work:

        const wchar_t * myText = L"café";
        wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
        if (myBuffer != NULL)
        {
                wcscpy(myBuffer, myText);
        }

> If a text editor is capable of dealing with Unicode text,
> perhaps it should also be able to explicitly DISPLAY the
> actual composition form of every glyph.

Against, this is not possible nor desirable, because a text editor is not
supposed to know how the compiler (or its runtime libraries) will transform
string literals.

> The question I posed in the previous paragraph should
> ideally be obvious by sight - if you see four characters,
> there are four characters; if you see five characters, there
> are five characters.

Provided that you can define what a "character" is... After a few years
reading this mailing list, I haven't seen a single acceptable definition of
"character".

Moreover, I matured the impression that it is totally irrelevant to have
such a definition:

- as an end user, I am interested in a higher level kind of objects (let's
call them "graphemes", i.e. those things I see on the screen and I can
interact with my mouse);

- as a programmer, I am interested in a lower lever kind of objects (let's
call them "encoding units", i.e. those things that I count when I have to
allocate memory for a string, or the like).

The term "character" is in a sort of conceptual limbo which makes it pretty
useless for everybody, IMHO.

_ Marco

Next message: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Previous message: Peter Kirk: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Next in thread: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Maybe reply: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Maybe reply: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 14:01:58 EST