Re: Unicode Normalisaton Optimisation Experiments

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Sep 25 2003 - 08:53:06 EDT

  • Next message: Eric Muller: "Re: Michael Everson in the news"

    On 25/09/2003 12:27, jon@spin.ie wrote:

    >>Is this actually correct? For example, if I have in my data the string
    >><U+0104, U+05B0> (which I know is garbage, but that is irrelevant), that
    >>
    >>will decompose and reorder to <U+0041, U+05B0, U+0328>, as U+05B0 has a
    >>
    >>higher combining class (202) than U+05B0 (10). What does this become in
    >>NFC? Is the reordering reversed and the combination reapplied?
    >>
    >>
    >
    >First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in <U+0104, U+05B0>.
    >
    >It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.
    >
    >
    Thanks for the clarification.

    >
    >
    >>This is not only a theoretical issue as the same applies to some real
    >>combinations. There was discussion only last week on the bidi list of a
    >>form which might be encoded <U+064A, U+0652, U+0654> but which would be
    >>
    >>messed up if composed into <U+0626, U+0652>.
    >>
    >>
    >
    >Yes, NFC would perform that composition. Are you sure it would be an issue? Applying bidi rules doesn't seem to make this an issue.
    ><U+064A, U+0652, U+0654>
    >bidi: Al, NSM, NSM
    >applying rule W1 from USA9:
    >Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.
    >
    ><U+0626, U+0652>
    >bidi: Al, NSM
    >applying rule W1:
    >Al, NSM -> Al, Al
    >
    >Or is the issue with something else, but it came up on the bidi list?
    >
    >
    >
    The problem isn't with the bidi rules but with more general Arabic
    shaping etc. There are two issues, one the position of the hamza (in
    this case it should be to the left of the sukun) and the other that the
    medial form of U+064A has dots below, which are required in this
    combination, but the medial form of U+0626 does not. But I think we
    concluded that U+0654 alone is not suitable for encoding this particular
    hamza.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 09:38:19 EDT