RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Dec 09 2003 - 13:14:03 EST

  • Next message: Marco Cimarosti: "RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"

    > Hmm. Now here's some C++ source code (syntax colored as
    > Philippe suggests, to imply that the text editor understands
    > C++ at least well :enough to color it)
    >
    > int n = wcslen(L"café");
    >
    > (That's int n = wcslen(L"café"); for those without HTML email)
    >
    > The L prefix on a string literal makes it a wide-character
    > string, and wcslen() is simply a wide-character version of
    > strlen(). (There is no guarantee that "wide character" means
    > "Unicode character", but let's just assume that it does, for
    > the moment).

    Even assuming that you can assume that "wide characters" are Unicode, you
    have not yet assumed in what kind of UTF they are. (Don't assume I
    deliberately making calembours :-)

    The only thing that the C(++) standards say about type "wchar_t" is that it
    is not smaller that type "char", so a "wide character" could well be a byte,
    and a "wide character string" could well be UTF-8, or even ASCII.

    > So, should n equal four or five?

    Why not six?

    If, in our C(++) compiler, type "wchar_t" is an alias for "char", and "wide
    character strings" are encoded in UTF-8, and the "é" is decomposed, then n
    will be equal to 6.

    > The answer would appear to depend on whether or not the
    > source file was saved in NFC or NFD format.

    The answer is:

            int n = wcslen(L"café");

    That's why you take the burden to call the "wcslen" library function rather
    than assuming a hard-coded value such as:

            int n = 4; // the length of string "café"

    > There is more to consider than just how and whether a text
    > editor normalizes.

    Whatever the editor does, what if then the *compiler* normalizes it?

    The source file and the compiled object file are not necessarily in the same
    encoding and/or normalization.

    A certain compiler could accept a certain range of input encodings (maybe
    declared with command-line parameter) and convert them all in a certain
    internal representation in the compiler object file (e.g., Unicode expressed
    in a particular UTF and with a particular normalization).

    That's why library functions such as "strlen" or "wcslen" exist. You don't
    need to bother what these functions will return in a particular compiler or
    environment, as far as the following code is guaranteed to work:

            const wchar_t * myText = L"café";
            wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
            if (myBuffer != NULL)
            {
                    wcscpy(myBuffer, myText);
            }

    > If a text editor is capable of dealing with Unicode text,
    > perhaps it should also be able to explicitly DISPLAY the
    > actual composition form of every glyph.

    Against, this is not possible nor desirable, because a text editor is not
    supposed to know how the compiler (or its runtime libraries) will transform
    string literals.

    > The question I posed in the previous paragraph should
    > ideally be obvious by sight - if you see four characters,
    > there are four characters; if you see five characters, there
    > are five characters.

    Provided that you can define what a "character" is... After a few years
    reading this mailing list, I haven't seen a single acceptable definition of
    "character".

    Moreover, I matured the impression that it is totally irrelevant to have
    such a definition:

    - as an end user, I am interested in a higher level kind of objects (let's
    call them "graphemes", i.e. those things I see on the screen and I can
    interact with my mouse);

    - as a programmer, I am interested in a lower lever kind of objects (let's
    call them "encoding units", i.e. those things that I count when I have to
    allocate memory for a string, or the like).

    The term "character" is in a sort of conceptual limbo which makes it pretty
    useless for everybody, IMHO.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 14:01:58 EST