Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 10 2003 - 21:42:04 EST

  • Next message: Philippe Verdy: "RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"

    Peter Kirk continued:

    > >Once again, people are falling afoul of the subtle distinctions
    > >that the Unicode conformance clauses are attempting to make.
    > >
    > >
    > In that case the distinctions are too subtle and need to be clarified.
    > C9 states that "no process can assume that another process will make a
    > distinction between two different, but canonical-equivalent character
    > sequences."

    No, C9 states:

    <quote>
    C9 A process shall not assume that the interpretations of two
    canonical-equivalent character sequences are distinct.
    </quote>

    You are quoting out of an explanatory bullet to clause C9. And
    in that context, it should be perfectly clear that the "distinctions"
    we are talking about are distinctions of interpretation. The
    *subsection* of Section 3.2 that C9 occurs in is also labelled
    "Interpretation".

    Quoting statements from the standard out of context, and then
    asserting that "distinction" means something other than it
    clearly does when seen *in context* isn't helping to make
    your case any.

    > If that in fact should be "no process can assume that
    > another process will *give different interpretations to* two different,
    > but canonical-equivalent character sequences", then that is what should
    > be written.

    O.k. that kind of explicitness might help others understand
    the text.

    > And even then the word "interpretation" needs to be clearly
    > defined, see below.

    "Interpretation" has been *deliberately* left undefined. It falls
    back to its general English usage, because attempting a
    technical definition of "interpretation" in the context of
    the Unicode Standard runs too far afield from the intended
    area of standardization. The UTC would end up bogged down
    in linguistic and semiotic theory attempting to nail this
    one down.

    What *is* clear is that a "distinction in interpretation of
    a character or character sequence" cannot be confused, by
    any careful reader of the standard, with "difference in
    code point or code point sequence". The latter *is* defined
    and totally unambiguous in the standard.

    > >It is perfectly conformant with the Unicode Standard to assert
    > >that <U+00E9> "é" and <U+0065, U+0301> "é" are different
    > >Unicode strings. They *are* different Unicode strings. They
    > >contain different encoded characters, and they have different
    > >lengths. ...
    > >
    > But they are "two different, but canonical-equivalent character
    > sequences", and as such "no process can assume that another process will
    > make a distinction between" them.
             ^^^^^^^^^^^
             distinction in interpretation
             
    You are quoting out of context again.

    > C9 does not say that certain
    > distinctions may be assumed and others may not.

    If you read it right, it absolutely *does* indicate that.

    > >... And any Unicode-conformant process that treated the
    > >second string as if it had only one code unit and only
    > >one encoded character in it would be a) stupid, and b)
    > >non-conformant. A Unicode process can not only assume that
    > >another Unicode-conformant process can make this distinction --
    > >it should *expect* it to or will run into interoperability
    > >problems.
    > >
    > >
    > Well, this goes entirely against how I had read and understood the
    > conformance clauses. The problem is, what does "interpretation" mean?

    "Interpretation" means..., well, it means "what it means".

    If you want to bandy semiotics, be my guest, but the Unicode
    Standard is not a semiotic standard. It is a character encoding
    standard.

    > >What canonical equivalence is about is making non-distinctions
    > >in the *interpretation* of equivalent sequences. No Unicode-
    > >conformant process should assume that another process will
    > >systematically distinguish a meaningful interpretation
    > >difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
    > >they both represent the *same* abstract character, namely
    > >an e-acute. And because of the conformance requirements
    > >in the Unicode Standard, I am not allowed to call some
    > >other process wrong if it claims to be handing me an "e-acute"
    > >and delivers <U+0065, U+0301> when I was expecting to
    > >see just <U+00E9>. ...
    > >
    > Well, the question here hangs on the meaning of "interpretation". I
    > understood "interpretation" to include such matters as determining the
    > number of characters in a string (although I carefully distinguished
    > that from determining the number of memory units required to store it,
    > which depends also on the encoding form and is at a quite different
    > level).

    Well, then please correct your interpretation of interpretation.

    <U+00E9> has one code point in it. It has one encoded character in it.

    <U+0065, U+0301> has two code points in it. It has two encoded
                characters in it.
                
    The two sequences are distinct and distinguished and
    distinguishable -- in terms of their code point or character
    sequences.
                
    The two sequences are canonically equivalent. They are not
    *interpreted* differently, since they both *mean* the same
    thing -- they are both interpreted as referring to the letter of
    various Latin alphabets known as "e-acute".

    *That* is what the Unicode Standard "means" by canonical equivalence.

    > I would understand a different character count to be "a
    > meaningful interpretation difference". As for the question "is this
    > string normalised?", at the interpretation level I have in mind that is
    > in fact a meaningless question because normalisation is, or should be,
    > hidden at a lower level.

    Then you are still mixing levels. You are operating here in terms
    of "user-perceived characters", but those are *not* a primitive
    of the Unicode Standard, and are not well-defined there, precisely
    because the character encoding per se cannot be and is not based
    entirely on psychological memes residing in the heads of the users
    of various written orthographies. It isn't arbitrarily disconnected
    from meaningful units that end users think of as "letters" or
    "syllables" or other useful graphological units, but those are
    not the determinative factors for the encoding itself nor for its
    statement of conformance requirements.

    If you are operating at a level where the question "is this string
    normalised" is meaningless, then you are talking about text
    content and not about the level where the conformance requirements
    of the Unicode Standard are relevant. No wonder you and others
    are confused.

    Of course, if I look on a printed page of text and see the word
    "café" rendered there as a token, it is meaningless to talk about
    whether the é is normalized or not. It just is a manifest token
    of the letter é, rendered on the page. The whole concept of
    Unicode normalization is irrelevant to a user at that level. But
    you cannot infer from that that normalization distinctions cannot
    be made conformantly in the encoded character stores for
    digital representation of text -- which is the relevant field
    where Unicode conformance issues apply.

    > But it seems that you are viewing the whole thing from a different level
    > from me. I am looking on as a user or an application programmer. You are
    > looking at Unicode internally, as a systems programmer. At that lower
    > level, yes, of course normalisation forms have to be distinguished
    > because that is the level at which normalisation is carried out.

    It isn't application programmer versus systems programmer. It is
    digital representation of text in terms of encoded characters
    versus end user interaction with rendered (printed, displayed)
    text.

    >
    > >... The whole point of normalization is
    > >to make life for implementations worth living in such a

    > Well, there is an interesting philosophical question here. With a normal
    > literary text, the interpretation of it intended by the author is
    > generally considered to be definitive. Humpty Dumpty was right when
    > talking about what he had written. But that is not true of laws, and I
    > suppose that it is similarly not true of standards.

    Standards are not laws. (Nor are they literary texts.) They
    are technical specifications which aim at enabling
    interoperable implementations of whatever they are
    a standard for. (At least the kinds of IT standards we
    are talking about here.)

    Standards are not adjudicated by case law. They are not
    interpreted by judges. If something is unclear in a standard,
    that is generally simply reported back as a defect to the
    standardization committee, which attempts to reach a
    consensus regarding what the actual intent was and then
    instructs the editor(s) to rewrite things so that the
    intent (which often turns out to be what everybody is
    implementing anyway) is made clearer. Or in some cases
    (see IEEE standards for examples), the standards development
    organization may issue a formal "clarification" spelling out
    the interpretation of a point that was unclear.

    > There is assumed to
    > be some objectivity to the language in which they are written. The
    > implication is that your assertion that what you have written is
    > conformant cannot be trusted a priori but must be tested against the
    > text of the standard as written and agreed. In principle any dispute
    > might have to be settled by a judge, and on the basis only of what is
    > written, not of what you claim was intended. While I certainly don't
    > intend to take this to court, I think I would have a reasonable case if
    > I did!

    I don't think it is reasonable. If anything, it is approaching
    harebrained here (sorry for the ad hominem), because it doesn't
    reflect the reality of IT standards development.

    What is often clearest to the standards development committee
    is what the intended behavior is to be. Writing that into the
    formal text of the standard, on the other hand, may stress the
    rhetorical capabilities of the authors, and you can end up with
    text that doesn't necessarily do the intent justice. Hence the
    need, for example, to keep rewriting the conformance clauses
    of the Unicode Standard until the character model finally
    started to gel and make sense to people implementing the standard.

    Trying to go legalistic, and trying to give objective
    primacy to the text of the standard, especially when you
    interpret the text differently than people on the originating
    committee who *wrote* the text, and in the face of counteropinions
    from engaged members of the responsible committee, is not,
    in my opinion, doing anybody any favors here.

    >
    > Of course it is possible for those conformance clauses to be rewritten
    > (they aren't fixed by the stability policy, are they?).

    Nope.

    > That is probably
    > what is necessary.

    In general, yes. If people are misinterpreting some key part of
    the conformance requirements of the standard, then both the
    UTC and the editors are interested in ensuring that the
    wording of the text is not encouraging such (mis)interpretations.

    > Such a rewrite would require a change to the sentence
    > "no process can assume that another process will make a distinction
    > between two different, but canonical-equivalent character sequences"

    Could be, but notice as above, this is already in an explanatory
    bullet, and is not the normative part of C9. Certainly if it
    is causing misinterpretation, the editors can address that,
    but I'm hearing other people on the list who are not having
    trouble coming to the correct conclusions in this particular
    instance.

    > and
    > a proper definition of "interpretation".

    Won't happen. See above.

    > Well, I had stated such things more tentatively to start with, asking
    > for contrary views and interpretations, but received none until now
    > except for Mark's very generalised implication that I had said something
    > wrong (and, incorrectly, that I hadn't read the relevant part of the
    > standard). Please, those of you who do know what is correct, keep us on
    > the right path. Otherwise the confusion will spread.

    I'll try. :-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 22:31:38 EST