Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 06:28:06 EST

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

Previous message: Philippe Verdy: "RE: character map in Microsoft Word"
In reply to: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/12/2003 17:16, Mark Davis wrote:

>Sure. "a" alone is a valid default grapheme cluster. Combining dieresis alone is
>a perfectly valid default grapheme cluster. 2 if separate, but one if
>concatenated (in the right order). This is similar (though not completely) to
>the case of words: "large fisher" contains two words; so does "man is", but when
>concatenated they only form 3 words.
>
>

Thank you. I was supposing that isolated combining marks were considered
in some way defective, much as (at least according to some people) XML
is treating them, and so either illegal or deprecated.

>Note that combining dieresis is *very* much different than the case of surrogate
>code points. D800 has no sensible independent existence: combining dieresis
>certainly does.
>
>That's one reason why we in ICU / Java chose an iteration interface for entities
>like this. You can find a boundary, or you can find the next or previous one (or
>nth). No general guarantees that concatenation will preserve the number -- or
>relative placement -- of boundaries. That's because the determination of
>boundaries is context-dependent. The same goes for line boundaries; look at
>TR#11.
>
>You could conceivably restrict your dream programming language to only
>'complete' default grapheme clusters, defined as those where the addition of
>previous characters would never change that boundary, but in practice I don't
>think your dream language would be particularly useful at, well, actual
>programming.
>
>
Well, I can see that what you have done in ICU/Java is actually more
useful, and that is the kind of interface I had in mind.

To go back closer to where we started, I would suggest that if people
are working with Unicode primarily at that level (of lack of
enlightenment concerning implementation details), then they should not
be asking questions like "is this string normalised?", and when they ask
"how many characters in this string" they should expect an answer in
terms of default grapheme clusters rather than code units or code
points. They shouldn't hack into lower levels unless they need to. And
this is effectively the level at which I was seeing C7-C9 operating re
"interpretation".

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Previous message: Philippe Verdy: "RE: character map in Microsoft Word"
In reply to: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 07:04:40 EST