Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 07:49:52 EST

  • Next message: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    On 12/12/2003 04:13, jon@hackcraft.net wrote:

    >>Thank you. I was supposing that isolated combining marks were considered
    >>in some way defective,
    >>
    >>
    >
    ><blockquote cite="http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf">
    >D17a: Defective combining character sequence: A combining character sequence
    >that does not start with a base character.
    >
    >[Explanatory Note] Defective combining character sequences occur when a
    >sequence of combining
    >characters appears at the start of a string or follows a control or format
    >character.
    >Such sequences are defective from the point of view of handling of combining
    >marks, but are not ill-formed.
    ></blockquote>
    >
    >"in some way defective" is actually a good way to put it methinks, they aren't
    >illegal, and in some cases you can do things with them that are both reasonable
    >and useful, but in other situations they may be problematic.
    >
    >
    >
    >
    Indeed. But I was thinking more in terms of grapheme clusters, as
    defined in UAX #29. Is a defective combining sequence a grapheme
    cluster? Probably not according to the definition "what the user thinks
    of as a character or basic unit of the language". But the boundary rule
    "/Break at the start and end of text./" implies that the algorithm will
    count a defective combining sequence at the start of text (and possibly
    what follows) as a default grapheme cluster. So it is "in some way
    defective" as a grapheme cluster as well as as a character sequence.

    I note the following in UAX #29, which backs up my comments on functions
    to count characters:

    > In those rare circumstances where end-users need character counts, the
    > counts should correspond to the grapheme cluster boundaries.

    This implies that end users should not require counts of code units or
    code points.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 08:47:03 EST