Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 11:08:53 EST

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

Previous message: Tim Greenwood: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/12/2003 05:43, Philippe Verdy wrote:

>>Thanks for the clarification. We are again talking at different levels.
>>I am still looking from the point of view of an application programmer
>>interested in a string as an abstract entity (an object or an abstract
>>data type) with a meaning or interpretation, but with no interest in the
>>exact encoding. You are looking at this at a lower level, either of a
>>systems programmer or of an application programmer who is forced to get
>>into this lower level stuff because of inadequate system support at the
>>more abstract level.
>>
>>
>
>Please stop this thread Peter, ...
>
Why should I stop this thread when you want to continue it? Will you
stop some of the very long, confusing and irrelevant threads you have
been involved in just because I ask you to?

>... Kenneth has been clear enough when pointing
>that the "in context" meaning of the problematic sentence you quoted from
>the standard was in fact clear enough to explain what is meant by
>"interpretation".
>
>
Reading the sentence in context did help, but there is still some lack
of clarity about what "interpretation" means.

>For me this relates to the interpretation of default grapheme clusters,
>which is where canonical equivalence applies.
>
>
I think we more or less agree here. But maybe not completely. Are there
are any ligatures in Unicode which are canonically equivalent to more
than one DGC? I thought that ligatures like 01C4-01CC and FB00-FB06 had
canonical decompositions, but I see not. If there are not, it would
appear that canonical equivalence applies only within the DGC, which
does simplify things. So, I think we can reasonably say that I am
working on the DGC level and see a string as a sequence of DGCs.

>...
>So if an application offers an interface that claims to operate on grapheme
>clusters, the conformance rule for canonical equivalence applies, and
>distinct but canonically equivalent encoding forms of any string must be
>treated the same.
>
>If you look at XML for example, there's no support for grapheme clusters as
>XML operates at the abstract character level (or code points), meaning that
>treating the same way all canonical equivalent strings is not required in a
>conforming XML processor.
>
>
That looks to me like a problem with XML, or at least an inconsistency.
The current situation has the advantage that XML can offer individual
styling etc of entities within a DGC, which is where this thread
started. But it has the disadvantage that XML is getting involved with a
lower level of Unicode than it needs to be, which simply complicates and
confuses things.

>But for a text renderer, or for a UCA collation algorithm, supporting the
>high-level grapheme clusters is required, and this is where canonically
>equivalences are the most meaningful and in fact required for Unicode
>conformance.
>
>
Agreed - except that technically this is not required for conformance in
a renderer but only strongly recommended.

>This may also be required for security-related texts (such as domain names
>in IDNA), where distinct but canonically equivalent strings must be given
>exactly the same meaning and resolve identically with the same
>"interpretation", as these items are intended to be exposed to users that
>will need to reproduce them the way they usually read or type them.
>
>
Understood. This is one special case of "interpretation". But the
security issue is deeper than this, because there are many possible ways
in which strings will be rendered essentially identically without being
canonically equivalent - sometimes but not always with compatibility
equivalence.

>The meaning of "interpretation" is then dependant of the application using
>Unicode texts. But it is directly related to the level at which the
>application operates on its claimed public interface: grapheme clusters,
>abstract characters/code points, code units, stream bytes.
>
>
>
Agreed. And this is where there seems to be some confusion. Perhaps it
is better for me to say that I wish to operate on the level of grapheme
clusters; and that at that level it is meaningless to ask whether a
grapheme cluster is normalised because that is hidden at a lower level.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Previous message: Tim Greenwood: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 11:56:29 EST