Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 10 2003 - 16:36:16 EST

  • Next message: Philippe Verdy: "RE: Coloured diacritics"

    Peter Kirk averred:

    > Agreed. C9 clearly specifies that a process cannot assume that another
    > process will give a correct answer to the question "is this string
    > normalised?", because that is to "assume that another process will make
    > a distinction between two different, but canonical-equivalent character
    > sequences."

    This is, however, nonsense.

    Once again, people are falling afoul of the subtle distinctions
    that the Unicode conformance clauses are attempting to make.

    It is perfectly conformant with the Unicode Standard to assert
    that <U+00E9> "é" and <U+0065, U+0301> "é" are different
    Unicode strings. They *are* different Unicode strings. They
    contain different encoded characters, and they have different
    lengths. And any Unicode-conformant process that treated the
    second string as if it had only one code unit and only
    one encoded character in it would be a) stupid, and b)
    non-conformant. A Unicode process can not only assume that
    another Unicode-conformant process can make this distinction --
    it should *expect* it to or will run into interoperability
    problems.

    What canonical equivalence is about is making non-distinctions
    in the *interpretation* of equivalent sequences. No Unicode-
    conformant process should assume that another process will
    systematically distinguish a meaningful interpretation
    difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
    they both represent the *same* abstract character, namely
    an e-acute. And because of the conformance requirements
    in the Unicode Standard, I am not allowed to call some
    other process wrong if it claims to be handing me an "e-acute"
    and delivers <U+0065, U+0301> when I was expecting to
    see just <U+00E9>. The whole point of normalization is
    to make life for implementations worth living in such a
    world of disappointed expectations. For no matter what
    those other processes hand me, I can then guarantee that
    I can turn it into a canonically equivalent form that I
    prefer to deal with and still be guaranteed that I am dealing
    with the same *interpretation* of the text.

    So now when Peter Kirk said:

    > a process cannot assume that another
    > process will give a correct answer to the question "is this string
    > normalised?"

    this is just wrong. If a Unicode-conformant process purports
    to do so, it is perfectly feasible *and* conformantly
    implementable. And another process (or programmer) can assume
    that such a process will give the right answer. (Of course,
    there could always be bugs in the implementation, but I
    don't think we need to argue about the meaning of "assume"
    in that sense, as well.)

    *I* have written such an API, in a library that I maintain.
    And I assert here that that API is fully Unicode-conformant.

    The ICU engineers have written such an API in ICU, and they
    assert that their API is fully Unicode-conformant.

    Not only have we written such API's, but some of us have
    also been responsible for writing the relevant conformance
    clauses in the standard itself, to which we are claiming
    conformance.

    There may be further room for clarification of intent here --
    clearly some people are still confused about what canonical
    equivalence is all about, and what constraints it does or
    does not place on conforming Unicode processes. But it
    doesn't help on this general list to keep turning that
    confusion into baldly stated (but incorrect) assertions about what
    conformant Unicode processes cannot do.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 17:24:38 EST