Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Dec 11 2003 - 20:16:59 EST

  • Next message: jean: "character map: for Morgan"

    Sure. "a" alone is a valid default grapheme cluster. Combining dieresis alone is
    a perfectly valid default grapheme cluster. 2 if separate, but one if
    concatenated (in the right order). This is similar (though not completely) to
    the case of words: "large fisher" contains two words; so does "man is", but when
    concatenated they only form 3 words.

    Note that combining dieresis is *very* much different than the case of surrogate
    code points. D800 has no sensible independent existence: combining dieresis
    certainly does.

    That's one reason why we in ICU / Java chose an iteration interface for entities
    like this. You can find a boundary, or you can find the next or previous one (or
    nth). No general guarantees that concatenation will preserve the number -- or
    relative placement -- of boundaries. That's because the determination of
    boundaries is context-dependent. The same goes for line boundaries; look at
    TR#11.

    You could conceivably restrict your dream programming language to only
    'complete' default grapheme clusters, defined as those where the addition of
    previous characters would never change that boundary, but in practice I don't
    think your dream language would be particularly useful at, well, actual
    programming.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Peter Kirk" <peterkirk@qaya.org>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: "Kenneth Whistler" <kenw@sybase.com>; <unicode@unicode.org>
    Sent: Thu, 2003 Dec 11 14:58
    Subject: Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

    > On 11/12/2003 10:16, Mark Davis wrote:
    >
    > >>Mark, don't patronise me. I'm not talking about levels of enlightenment.
    > >>I'm not talking about levels in the sense you just used when you
    > >>mentioned "higher-level issues". I'm talking about the well-known
    > >>concept of levels or layers of programming and of communication protocols.
    > >>
    > >>
    > >
    > >My apologies; I had intended a lighter tone, not patronization.
    > >
    > >
    > >
    > Apology accepted. I should have recognised the "enlightenment" of the tone.
    >
    > > ...
    > >
    > >One could, of course, design a programming language that always indexed and
    > >counted by some other entity, say, default grapheme clusters. Such a language
    > >would be be unable to deal with pieces that didn't constitute a complete
    > >cluster, or and have to deal with the issues such as that the number of
    entities
    > >in the concatenation of two strings is not the same as the same as the sum of
    > >number of numbers of entities in each of the strings, so indexing gets pretty
    > >tricky. I don't know of any programming language that has tried to do this,
    and
    > >I don't think it would be of particular value -- and would be exceedlingly
    slow.
    > >
    > >
    > This is I suppose what I was thinking of. I see the problem if partial
    > clusters are permitted, but they could be forbidden from this type. Is
    > there ever a case where a concatenation of n DGCs and m DGCs is not
    > equal to (n+m) DGCs? If so there is a small problem, but one which is
    > surmountable if it is made clear that concatenation does not always
    > imply addition of string length. I do think this would be a useful thing
    > to do, and Benjamin, who seems to agree, suggests that .NET does it at
    > least to some extent. I am sure that some tricks could be found to
    > simplify the indexing if necessary, e.g. using PUA or non-character code
    > points indexed into a special table to replace DGCs which cannot be
    > represented as a single character. (There are plenty of non-characters
    > available as you need to use UTF-32 here to avoid exactly the same
    > problems with surrogates.)
    >
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 21:06:30 EST