RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 11 2003 - 08:43:28 EST

  • Next message: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    > Thanks for the clarification. We are again talking at different levels.
    > I am still looking from the point of view of an application programmer
    > interested in a string as an abstract entity (an object or an abstract
    > data type) with a meaning or interpretation, but with no interest in the
    > exact encoding. You are looking at this at a lower level, either of a
    > systems programmer or of an application programmer who is forced to get
    > into this lower level stuff because of inadequate system support at the
    > more abstract level.

    Please stop this thread Peter, Kenneth has been clear enough when pointing
    that the "in context" meaning of the problematic sentence you quoted from
    the standard was in fact clear enough to explain what is meant by
    "interpretation".

    For me this relates to the interpretation of default grapheme clusters,
    which is where canonical equivalence applies.

    If you go to the abstract character level, there's no such "equivalence"
    rule as normalization operates on default grapheme clusters, but not on the
    lower level (abstract characters or code points, and not even at the even
    lower level of code units in a encoding form, or stream bytes in a encoding
    scheme or transport encoding syntaxes).

    So if an application offers an interface that claims to operate on grapheme
    clusters, the conformance rule for canonical equivalence applies, and
    distinct but canonically equivalent encoding forms of any string must be
    treated the same.

    If you look at XML for example, there's no support for grapheme clusters as
    XML operates at the abstract character level (or code points), meaning that
    treating the same way all canonical equivalent strings is not required in a
    conforming XML processor.

    But for a text renderer, or for a UCA collation algorithm, supporting the
    high-level grapheme clusters is required, and this is where canonically
    equivalences are the most meaningful and in fact required for Unicode
    conformance.

    This may also be required for security-related texts (such as domain names
    in IDNA), where distinct but canonically equivalent strings must be given
    exactly the same meaning and resolve identically with the same
    "interpretation", as these items are intended to be exposed to users that
    will need to reproduce them the way they usually read or type them.

    The meaning of "interpretation" is then dependant of the application using
    Unicode texts. But it is directly related to the level at which the
    application operates on its claimed public interface: grapheme clusters,
    abstract characters/code points, code units, stream bytes.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 09:33:34 EST