Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Dec 11 2003 - 10:40:00 EST

  • Next message: Tim Greenwood: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    Peter, here is your original remark. Ken has gracefully filled the gap in
    explaining the higher-level issues, but let's return to that for a minute.

    >>No, surely not. If the wcslen() function is fully Unicode conformant, it
    >>should give the same output whatever the canonically equivalent form of
    >>its input. That more or less implies that it should normalise its input.

    Talking about looking at the problem "at levels" really obscures the issues.
    Programmers call functions. Those functions don't magically change when one
    achieves a new Level of Enlightenment.

    The function wcslen is defined as "Determines the number of characters in a
    wide-character string." In C, those are not even defined to be Unicode
    characters. IF Unicode is used, wide-characters (wchar_t) may be codepoints or
    code units, depending on the implementation. The function is not defined -- and
    could never be redefined, without huge breakage -- to return the number of NFC
    codepoints.

    Part of the problem is that "character" can be interpreted in a wide variety of
    ways, which is why we were forced into developing more precise terms like code
    units. So in general:

    1. If you want a function that returns the number of code units in X, you need
    to call one that is defined to do so.
    2. If you want a function that returns the number of code points in X, you need
    to call one that is defined to do so.
    3. If you want a function that returns the number of code points in toNFC(x),
    you need to call one that is defined to do so.
    4. If you want a function that returns the number of grapheme clusters in X, you
    need to call one that is defined to do so.
    5. If you want a function that returns the number of glyphs in X using font F
    and parameters P, you need to call one that is defined to do so.
    - And so on.

    There is a pattern here.

    Of course in reality, there might not be individual functions for these. The
    most commonly used of these functions will always be #1, no matter what one's
    Level of Enlightenment is. That's because people typically need to know how much
    actual storage a string takes. Glyph counts are used in the guts of rendering,
    as one does text layout, but is generally hidden from all but the most (or
    least -- I'm not sure of the Official Precedence List!) Enlightened, such as
    Paul or Eric. For grapheme clusters, it is typically more useful to have a
    function for not how many there are in a string, but whether you are on a
    boundary, and what the previous/next boundaries are (the 'how many' can be
    derived from these, of course). [This is the same for a lot of higher-level
    constructs, like word-breaks or line-breaks.]

    For canonical equivalence, there are (at least) two strategies (see UAX #15
    Annex #13). Ensure that your strings are in a particular format (e.g. NFC), then
    just count code points (#2). Or call a function designed to do #3; such a
    function can be faster than countCodePoints(toNFC(X)), because it can have all
    sorts of optimizations that people like Markus revel in.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Peter Kirk" <peterkirk@qaya.org>
    To: "Kenneth Whistler" <kenw@sybase.com>
    Cc: <unicode@unicode.org>
    Sent: Thu, 2003 Dec 11 04:29
    Subject: Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

    > On 10/12/2003 18:42, Kenneth Whistler wrote:
    >
    > > ...
    > >
    > >>And even then the word "interpretation" needs to be clearly
    > >>defined, see below.
    > >>
    > >>
    > >
    > >"Interpretation" has been *deliberately* left undefined. It falls
    > >back to its general English usage, because attempting a
    > >technical definition of "interpretation" in the context of
    > >the Unicode Standard runs too far afield from the intended
    > >area of standardization. The UTC would end up bogged down
    > >in linguistic and semiotic theory attempting to nail this
    > >one down.
    > >
    > >What *is* clear is that a "distinction in interpretation of
    > >a character or character sequence" cannot be confused, by
    > >any careful reader of the standard, with "difference in
    > >code point or code point sequence". The latter *is* defined
    > >and totally unambiguous in the standard.
    > >
    > >
    >
    > Thanks for the clarification. We are again talking at different levels.
    > I am still looking from the point of view of an application programmer
    > interested in a string as an abstract entity (an object or an abstract
    > data type) with a meaning or interpretation, but with no interest in the
    > exact encoding. You are looking at this at a lower level, either of a
    > systems programmer or of an application programmer who is forced to get
    > into this lower level stuff because of inadequate system support at the
    > more abstract level.
    >
    > > ...
    > >
    > >Well, then please correct your interpretation of interpretation.
    > >
    > ><U+00E9> has one code point in it. It has one encoded character in it.
    > >
    > ><U+0065, U+0301> has two code points in it. It has two encoded
    > > characters in it.
    > >
    > >The two sequences are distinct and distinguished and
    > >distinguishable -- in terms of their code point or character
    > >sequences.
    > >
    > >The two sequences are canonically equivalent. They are not
    > >*interpreted* differently, since they both *mean* the same
    > >thing -- they are both interpreted as referring to the letter of
    > >various Latin alphabets known as "e-acute".
    > >
    > >*That* is what the Unicode Standard "means" by canonical equivalence.
    > >
    > >
    > >
    > Thanks again for the clarification. Again, I am not interested in code
    > point sequences but in meaning. I have been forced to get involved in
    > code point issues when I have found that they have not made the
    > necessary meaning distinctions. But my interest is essentially higher
    > level, which is why I am trying to push all of these non-meaningful
    > distinctions into a low level hidden from my view.
    >
    > >...
    > >
    > >If you are operating at a level where the question "is this string
    > >normalised" is meaningless, then you are talking about text
    > >content and not about the level where the conformance requirements
    > >of the Unicode Standard are relevant. No wonder you and others
    > >are confused.
    > >
    > >Of course, if I look on a printed page of text and see the word
    > >"café" rendered there as a token, it is meaningless to talk about
    > >whether the é is normalized or not. It just is a manifest token
    > >of the letter é, rendered on the page. The whole concept of
    > >Unicode normalization is irrelevant to a user at that level. But
    > >you cannot infer from that that normalization distinctions cannot
    > >be made conformantly in the encoded character stores for
    > >digital representation of text -- which is the relevant field
    > >where Unicode conformance issues apply.
    > >
    > >
    > >
    > Ken, now you seem to be trying to define out of existence a level at
    > which C7-C9 and probably also C10 (at least the part about
    > canonical-equivalent sequences) are relevant. I accept, because of your
    > explanation above, that there is a lower level at which they are not
    > relevant, because it is concerned with encoded character sequences and
    > not with interpretation. But above that level there is surely a separate
    > level at which interpretation is relevant, and that is not just the
    > level of printed texts outside a computer system. If there isn't such a
    > level, C7-C10 are redundant and meaningless.
    >
    > At the level I have in mind all kinds of important processes take place
    > within a computer system. Some of these are defined by Unicode, e.g.
    > collation, which is independent of the canonically equivalent form
    > because it starts with normalisation. Others e.g. automatic translation
    > are not defined by Unicode. For all processing at this level "Ideally,
    > an implementation would always interpret two canonical-equivalent
    > character sequences identically" (quote from C9). Rendering is also
    > effectively at this level. And at this level the question "is this
    > string normalised?" is meaningless, because we are looking at the text
    > content and its interpretation, and not at the encoded form. There is of
    > course an encoded form lying behind that text content, but that should
    > be no more the concern of the end user than the UTF form or than the
    > pattern of on and off transistors or magnetic particles in the
    > computer's memory, and it should be hidden from the end user by an API.
    >
    > > ...
    > >
    > >Standards are not adjudicated by case law. They are not
    > >interpreted by judges. ...
    > >
    > Surely in principle they could be, if there was for example a dispute
    > over fulfilment of a contract which specified that a product must
    > conform to Unicode. But this is a red herring here, I realise.
    >
    > > ...
    > >
    > >>Well, I had stated such things more tentatively to start with, asking
    > >>for contrary views and interpretations, but received none until now
    > >>except for Mark's very generalised implication that I had said something
    > >>wrong (and, incorrectly, that I hadn't read the relevant part of the
    > >>standard). Please, those of you who do know what is correct, keep us on
    > >>the right path. Otherwise the confusion will spread.
    > >>
    > >>
    > >
    > >I'll try. :-)
    > >
    > >
    > Thank you, and thank you for giving your time to this issue.
    >
    > >--Ken
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 11:21:49 EST