Re: An attempt to focus the PUA discussion [long]

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 29 2004 - 19:56:31 EDT

  • Next message: Kenneth Whistler: "Re: Public Review Issues Updated"

    Peter Kirk wrote, in response to Ernest Cline:

    > >... It simply is impossible
    > >to simulate non-zero canonical combining class characters in Unicode
    > >with anything other than a character with the appropriate canonical
    > >combining class. ...
    > >
    >
    > True. But fortunately Unicode don't really need to worry about
    > normalisation of PUA data, as this is surely out of its scope.

    Not quite. PUA code points are subject to the Unicode normalization
    algorithm, as well as any other. Their behavior in NFC or NFD,
    for example, is rigidly defined, if trivial: a PUA code point
    normalizes to itself.

    That doesn't prevent a user of PUA code points, defined as
    whatever characters, from also defining equivalences among
    PUA characters (or sequences) or between PUA characters
    (or sequences) and standard Unicode characters (or sequences),
    and then "normalizing" data based on those equivalences.
    Any such normalization would *not*, however, be Unicode
    normalization, as defined in UAX #15. Such equivalences, which
    depend on interpretations of PUA code points that are the
    result of private agreement, are themselves subject to
    private agreement.

    So, to specify it in terms of actual code points, and using
    my previous example, if I define:

    U+E000 = SNOREFRED LOGO CHARACTER (glyph: an elmtree with two chipmunks)

    U+E001 = ELMTREE SYMBOL
    U+E002 = COMBINING CHIPMUNK

    There is nothing which would prevent me from asserting that
    <E000> is equivalent to <E001, E002, E002>, on my interpretation
    of what those characters mean.

    On the other hand, I could not expect any software doing
    Unicode normalization to pay any attention to *my* interpretation
    of those equivalences, and if I really wanted to process data
    using such equivalences, it would be up to me to write the
    software to do so.

    > So it
    > would not be fatal to use class 0 combining characters in PUA scripts,
    > and leave to the user any possible burden of ensuring that multiple
    > combining marks are correctly ordered. It would be enough to indicate
    > that a particular character is, or is currently being used as, a
    > combining mark.

    Correct. And to ensure that any font that you designed handled
    it correctly.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Apr 29 2004 - 20:29:28 EDT