Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 17:58:39 EST

  • Next message: wjbm820: "character map in Microsoft Word"

    On 11/12/2003 10:16, Mark Davis wrote:

    >>Mark, don't patronise me. I'm not talking about levels of enlightenment.
    >>I'm not talking about levels in the sense you just used when you
    >>mentioned "higher-level issues". I'm talking about the well-known
    >>concept of levels or layers of programming and of communication protocols.
    >>
    >>
    >
    >My apologies; I had intended a lighter tone, not patronization.
    >
    >
    >
    Apology accepted. I should have recognised the "enlightenment" of the tone.

    > ...
    >
    >One could, of course, design a programming language that always indexed and
    >counted by some other entity, say, default grapheme clusters. Such a language
    >would be be unable to deal with pieces that didn't constitute a complete
    >cluster, or and have to deal with the issues such as that the number of entities
    >in the concatenation of two strings is not the same as the same as the sum of
    >number of numbers of entities in each of the strings, so indexing gets pretty
    >tricky. I don't know of any programming language that has tried to do this, and
    >I don't think it would be of particular value -- and would be exceedlingly slow.
    >
    >
    This is I suppose what I was thinking of. I see the problem if partial
    clusters are permitted, but they could be forbidden from this type. Is
    there ever a case where a concatenation of n DGCs and m DGCs is not
    equal to (n+m) DGCs? If so there is a small problem, but one which is
    surmountable if it is made clear that concatenation does not always
    imply addition of string length. I do think this would be a useful thing
    to do, and Benjamin, who seems to agree, suggests that .NET does it at
    least to some extent. I am sure that some tricks could be found to
    simplify the indexing if necessary, e.g. using PUA or non-character code
    points indexed into a special table to replace DGCs which cannot be
    represented as a single character. (There are plenty of non-characters
    available as you need to use UTF-32 here to avoid exactly the same
    problems with surrogates.)

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 18:42:48 EST