Re: Freedom to Normalise (was: US ballot comments re ARIB ideographs)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 05 2008 - 18:12:57 CDT

  • Next message: David Starner: "Re: Undefined code positions in 8-bit character sets"

    Richard Wordingham wrote, regarding:

    > > In other words, any piece of code is free to normalize.
    >
    > Although my reply is a bit late for immediate application, I have been
    > assured that that is a fallacy for which I fell. An example of a piece of
    > code that is not free to normalise is a routine performing compliant default
    > upper-casing. Consider the recording of a no longer verifiable reading of a
    > lower case alpha with a subscript iota.

    [long complex example involving casing of alpha + iota subscript +
    diacritic omitted]

    > Presumably no Unicode-compliant process may assume that another process will
    > perform default upper-casing compliantly!

    Huh? Casing changes the interpretation of text, so it differs
    significantly from canonical equivalence.

    In general:

    interpretation(X) = interpretation(NFC(X)) = interpretation(NFD(X))

    But in general:

    interpretation(X) != interpretation(toLowercase(X))
                      != interpretation(toUppercase(X))
                      
    Of course there are many choices of X for which one or both
    of those expressions may be equal, but in general a casing transformation
    can (and often does) change the interpretation of text, in
    the narrow sense of "interpretation" defined in the Unicode
    conformance clauses.

    So given that, it should not be surprising that it follows that:

    interpretation(X) != interpretation(toUppercase(NFD(X)))

    > Is there some subtlety here? Perhaps in what constitutes a process?

    There are myriad types of text processes. Many of them do not
    maintain text "interpretation" in the narrow sense -- they
    are *intended* to change things.

    This differs from (canonical) normalization, which by definition
    does not change the "interpretation" of text. For the purposes
    of conformance per se, if I hand you X and you hand me back
    NFC(X) or NFD(X), then you have handed me back text intended
    to have the same "interpretation". It may not be *identical*
    text, of course, because the sequence of code points could
    be different, and the length of the text may be different,
    but its interpretation should be the same.

    Once you start applying casing operations, you no longer have
    that claim to same interpretation. I may recognize that you
    have properly cased a string according to the default casing
    rules (in which instance you can validly claim conformance to
    those casing rules), or I may, with your agreement, recognize
    that you have applied *other* casing rules, including whatever
    conventions you want to put in effect about expanding diacritics
    across 1 <--> 2 casing transforms, but what I won't see is
    you handing me back text with the *same* (Unicode) interpretation under
    such transforms. And any neutral third party (other implementation)
    should agree with those conclusions, as well, if they have
    properly implemented Unicode normalization.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 18:17:13 CDT