Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Dec 09 2003 - 12:12:59 EST

  • Next message: Peter Constable: "RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))"

    On 09/12/2003 07:00, Arcane Jill wrote:

    >
    > Hmm. Now here's some C++ source code (syntax colored as Philippe
    > suggests, to imply that the text editor understands C++ at least well
    > :enough to color it)
    >
    > int n = wcslen(L"café");
    >
    > (That's int n = wcslen(L"café"); for those without HTML email)
    >
    > The L prefix on a string literal makes it a wide-character string, and
    > wcslen() is simply a wide-character version of strlen(). (There is no
    > guarantee that "wide character" means "Unicode character", but let's
    > just assume that it does, for the moment).
    >
    > So, should n equal four or five? The answer would appear to depend on
    > whether or not the source file was saved in NFC or NFD format.
    >
    No, surely not. If the wcslen() function is fully Unicode conformant, it
    should give the same output whatever the canonically equivalent form of
    its input. That more or less implies that it should normalise its input.
    (One can imagine a second parameter specifying whether NFC or NFD is
    required.) This makes the issue one not for the text editor but for the
    programming language or its string handling library.

    > There is more to consider than just how and whether a text editor
    > normalizes. If a text editor is capable of dealing with Unicode text,
    > perhaps it should also be able to explicitly DISPLAY the actual
    > composition form of every glyph. The question I posed in the previous
    > paragraph should ideally be obvious by sight - if you see four
    > characters, there are four characters; if you see five characters,
    > there are five characters. This implies that such a text editor should
    > display NFD text as separate glyphs for each character.
    >
    > On the other hand, such a text editor must also acknowledge that "é"
    > and "e + U+0301" are actually equivalent. The /intention/ of canonical
    > equivalence is that the glyphs should display the same - otherwise
    > we'd need precomposed versions of, well, everything. So in other
    > contexts, is should display them the same.
    >
    The Unicode standard does allow for special display modes in which the
    exact underlying string, including control characters, is made visible.

    > Yuk. That's a lot to think about for anyone considering writing a
    > programmers' text editor with /serious/ Unicode support.
    > Jill
    >
    >
    Simply allow the text editor to save as either NFC or NFD, and let the
    programming language sort out the rest.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 13:02:13 EST