Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 10 2003 - 19:06:17 EST

Next message: Christopher John Fynn: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"

Previous message: Philippe Verdy: "RE: Coloured diacritics"
In reply to: Kenneth Whistler: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Doug Ewell: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 10/12/2003 13:36, Kenneth Whistler wrote:

>Peter Kirk averred:
>
>
>
>>Agreed. C9 clearly specifies that a process cannot assume that another
>>process will give a correct answer to the question "is this string
>>normalised?", because that is to "assume that another process will make
>>a distinction between two different, but canonical-equivalent character
>>sequences."
>>
>>
>
>This is, however, nonsense.
>
>Once again, people are falling afoul of the subtle distinctions
>that the Unicode conformance clauses are attempting to make.
>
>
In that case the distinctions are too subtle and need to be clarified.
C9 states that "no process can assume that another process will make a
distinction between two different, but canonical-equivalent character
sequences." If that in fact should be "no process can assume that
another process will *give different interpretations to* two different,
but canonical-equivalent character sequences", then that is what should
be written. And even then the word "interpretation" needs to be clearly
defined, see below.

>It is perfectly conformant with the Unicode Standard to assert
>that <U+00E9> "é" and <U+0065, U+0301> "é" are different
>Unicode strings. They *are* different Unicode strings. They
>contain different encoded characters, and they have different
>lengths. ...
>
But they are "two different, but canonical-equivalent character
sequences", and as such "no process can assume that another process will
make a distinction between" them. C9 does not say that certain
distinctions may be assumed and others may not.

>... And any Unicode-conformant process that treated the
>second string as if it had only one code unit and only
>one encoded character in it would be a) stupid, and b)
>non-conformant. A Unicode process can not only assume that
>another Unicode-conformant process can make this distinction --
>it should *expect* it to or will run into interoperability
>problems.
>
>
Well, this goes entirely against how I had read and understood the
conformance clauses. The problem is, what does "interpretation" mean?
See below.

>
>What canonical equivalence is about is making non-distinctions
>in the *interpretation* of equivalent sequences. No Unicode-
>conformant process should assume that another process will
>systematically distinguish a meaningful interpretation
>difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
>they both represent the *same* abstract character, namely
>an e-acute. And because of the conformance requirements
>in the Unicode Standard, I am not allowed to call some
>other process wrong if it claims to be handing me an "e-acute"
>and delivers <U+0065, U+0301> when I was expecting to
>see just <U+00E9>. ...
>
Well, the question here hangs on the meaning of "interpretation". I
understood "interpretation" to include such matters as determining the
number of characters in a string (although I carefully distinguished
that from determining the number of memory units required to store it,
which depends also on the encoding form and is at a quite different
level). I would understand a different character count to be "a
meaningful interpretation difference". As for the question "is this
string normalised?", at the interpretation level I have in mind that is
in fact a meaningless question because normalisation is, or should be,
hidden at a lower level.

But it seems that you are viewing the whole thing from a different level
from me. I am looking on as a user or an application programmer. You are
looking at Unicode internally, as a systems programmer. At that lower
level, yes, of course normalisation forms have to be distinguished
because that is the level at which normalisation is carried out.

>... The whole point of normalization is
>to make life for implementations worth living in such a
>world of disappointed expectations. For no matter what
>those other processes hand me, I can then guarantee that
>I can turn it into a canonically equivalent form that I
>prefer to deal with and still be guaranteed that I am dealing
>with the same *interpretation* of the text.
>
>
... as long as the other process hasn't asked you to tell it whether
what it has just handed to you is normalised or not.

>So now when Peter Kirk said:
>
>
>
>>a process cannot assume that another
>>process will give a correct answer to the question "is this string
>>normalised?"
>>
>>
>
>this is just wrong. If a Unicode-conformant process purports
>to do so, it is perfectly feasible *and* conformantly
>implementable. And another process (or programmer) can assume
>that such a process will give the right answer. (Of course,
>there could always be bugs in the implementation, but I
>don't think we need to argue about the meaning of "assume"
>in that sense, as well.)
>
>*I* have written such an API, in a library that I maintain.
>And I assert here that that API is fully Unicode-conformant.
>
>The ICU engineers have written such an API in ICU, and they
>assert that their API is fully Unicode-conformant.
>
>Not only have we written such API's, but some of us have
>also been responsible for writing the relevant conformance
>clauses in the standard itself, to which we are claiming
>conformance.
>
>

Well, there is an interesting philosophical question here. With a normal
literary text, the interpretation of it intended by the author is
generally considered to be definitive. Humpty Dumpty was right when
talking about what he had written. But that is not true of laws, and I
suppose that it is similarly not true of standards. There is assumed to
be some objectivity to the language in which they are written. The
implication is that your assertion that what you have written is
conformant cannot be trusted a priori but must be tested against the
text of the standard as written and agreed. In principle any dispute
might have to be settled by a judge, and on the basis only of what is
written, not of what you claim was intended. While I certainly don't
intend to take this to court, I think I would have a reasonable case if
I did!

Of course it is possible for those conformance clauses to be rewritten
(they aren't fixed by the stability policy, are they?). That is probably
what is necessary. Such a rewrite would require a change to the sentence
"no process can assume that another process will make a distinction
between two different, but canonical-equivalent character sequences" and
a proper definition of "interpretation".

>There may be further room for clarification of intent here --
>clearly some people are still confused about what canonical
>equivalence is all about, and what constraints it does or
>does not place on conforming Unicode processes. But it
>doesn't help on this general list to keep turning that
>confusion into baldly stated (but incorrect) assertions about what
>conformant Unicode processes cannot do.
>
>--Ken
>
>
>
Well, I had stated such things more tentatively to start with, asking
for contrary views and interpretations, but received none until now
except for Mark's very generalised implication that I had said something
wrong (and, incorrectly, that I hadn't read the relevant part of the
standard). Please, those of you who do know what is correct, keep us on
the right path. Otherwise the confusion will spread.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Christopher John Fynn: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Previous message: Philippe Verdy: "RE: Coloured diacritics"
In reply to: Kenneth Whistler: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Doug Ewell: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 20:03:52 EST