Re: Unicode Normalisaton Optimisation Experiments

Date: Thu Sep 25 2003 - 07:32:32 EDT

  • Next message: Peter Kirk: "Re: Fun with proof by analogy, was Re: Mojibake on my Web pages"

    > Is this actually correct? For example, if I have in my data the string
    > <U+0104, U+05B0> (which I know is garbage, but that is irrelevant), that
    > will decompose and reorder to <U+0041, U+05B0, U+0328>, as U+05B0 has a
    > higher combining class (202) than U+05B0 (10). What does this become in
    > NFC? Is the reordering reversed and the combination reapplied?

    First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in <U+0104, U+05B0>.

    It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.

    > This is not only a theoretical issue as the same applies to some real
    > combinations. There was discussion only last week on the bidi list of a
    > form which might be encoded <U+064A, U+0652, U+0654> but which would be
    > messed up if composed into <U+0626, U+0652>.

    Yes, NFC would perform that composition. Are you sure it would be an issue? Applying bidi rules doesn't seem to make this an issue.
    <U+064A, U+0652, U+0654>
    bidi: Al, NSM, NSM
    applying rule W1 from USA9:
    Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.

    <U+0626, U+0652>
    bidi: Al, NSM
    applying rule W1:
    Al, NSM -> Al, Al

    Or is the issue with something else, but it came up on the bidi list?

    This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 08:22:31 EDT