Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 07:29:44 EST

  • Next message: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    On 10/12/2003 18:42, Kenneth Whistler wrote:

    > ...
    >
    >>And even then the word "interpretation" needs to be clearly
    >>defined, see below.
    >>
    >>
    >
    >"Interpretation" has been *deliberately* left undefined. It falls
    >back to its general English usage, because attempting a
    >technical definition of "interpretation" in the context of
    >the Unicode Standard runs too far afield from the intended
    >area of standardization. The UTC would end up bogged down
    >in linguistic and semiotic theory attempting to nail this
    >one down.
    >
    >What *is* clear is that a "distinction in interpretation of
    >a character or character sequence" cannot be confused, by
    >any careful reader of the standard, with "difference in
    >code point or code point sequence". The latter *is* defined
    >and totally unambiguous in the standard.
    >
    >

    Thanks for the clarification. We are again talking at different levels.
    I am still looking from the point of view of an application programmer
    interested in a string as an abstract entity (an object or an abstract
    data type) with a meaning or interpretation, but with no interest in the
    exact encoding. You are looking at this at a lower level, either of a
    systems programmer or of an application programmer who is forced to get
    into this lower level stuff because of inadequate system support at the
    more abstract level.

    > ...
    >
    >Well, then please correct your interpretation of interpretation.
    >
    ><U+00E9> has one code point in it. It has one encoded character in it.
    >
    ><U+0065, U+0301> has two code points in it. It has two encoded
    > characters in it.
    >
    >The two sequences are distinct and distinguished and
    >distinguishable -- in terms of their code point or character
    >sequences.
    >
    >The two sequences are canonically equivalent. They are not
    >*interpreted* differently, since they both *mean* the same
    >thing -- they are both interpreted as referring to the letter of
    >various Latin alphabets known as "e-acute".
    >
    >*That* is what the Unicode Standard "means" by canonical equivalence.
    >
    >
    >
    Thanks again for the clarification. Again, I am not interested in code
    point sequences but in meaning. I have been forced to get involved in
    code point issues when I have found that they have not made the
    necessary meaning distinctions. But my interest is essentially higher
    level, which is why I am trying to push all of these non-meaningful
    distinctions into a low level hidden from my view.

    >...
    >
    >If you are operating at a level where the question "is this string
    >normalised" is meaningless, then you are talking about text
    >content and not about the level where the conformance requirements
    >of the Unicode Standard are relevant. No wonder you and others
    >are confused.
    >
    >Of course, if I look on a printed page of text and see the word
    >"café" rendered there as a token, it is meaningless to talk about
    >whether the é is normalized or not. It just is a manifest token
    >of the letter é, rendered on the page. The whole concept of
    >Unicode normalization is irrelevant to a user at that level. But
    >you cannot infer from that that normalization distinctions cannot
    >be made conformantly in the encoded character stores for
    >digital representation of text -- which is the relevant field
    >where Unicode conformance issues apply.
    >
    >
    >
    Ken, now you seem to be trying to define out of existence a level at
    which C7-C9 and probably also C10 (at least the part about
    canonical-equivalent sequences) are relevant. I accept, because of your
    explanation above, that there is a lower level at which they are not
    relevant, because it is concerned with encoded character sequences and
    not with interpretation. But above that level there is surely a separate
    level at which interpretation is relevant, and that is not just the
    level of printed texts outside a computer system. If there isn't such a
    level, C7-C10 are redundant and meaningless.

    At the level I have in mind all kinds of important processes take place
    within a computer system. Some of these are defined by Unicode, e.g.
    collation, which is independent of the canonically equivalent form
    because it starts with normalisation. Others e.g. automatic translation
    are not defined by Unicode. For all processing at this level "Ideally,
    an implementation would always interpret two canonical-equivalent
    character sequences identically" (quote from C9). Rendering is also
    effectively at this level. And at this level the question "is this
    string normalised?" is meaningless, because we are looking at the text
    content and its interpretation, and not at the encoded form. There is of
    course an encoded form lying behind that text content, but that should
    be no more the concern of the end user than the UTF form or than the
    pattern of on and off transistors or magnetic particles in the
    computer's memory, and it should be hidden from the end user by an API.

    > ...
    >
    >Standards are not adjudicated by case law. They are not
    >interpreted by judges. ...
    >
    Surely in principle they could be, if there was for example a dispute
    over fulfilment of a contract which specified that a product must
    conform to Unicode. But this is a red herring here, I realise.

    > ...
    >
    >>Well, I had stated such things more tentatively to start with, asking
    >>for contrary views and interpretations, but received none until now
    >>except for Mark's very generalised implication that I had said something
    >>wrong (and, incorrectly, that I hadn't read the relevant part of the
    >>standard). Please, those of you who do know what is correct, keep us on
    >>the right path. Otherwise the confusion will spread.
    >>
    >>
    >
    >I'll try. :-)
    >
    >
    Thank you, and thank you for giving your time to this issue.

    >--Ken
    >
    >
    >
    >
    >
    >
    >

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 08:15:34 EST