# Re: Unicode Normalisaton Optimisation Experiments

From: jon@spin.ie
Date: Thu Sep 25 2003 - 07:32:32 EDT

• Next message: Peter Kirk: "Re: Fun with proof by analogy, was Re: Mojibake on my Web pages"

> Is this actually correct? For example, if I have in my data the string
> <U+0104, U+05B0> (which I know is garbage, but that is irrelevant), that
>
> will decompose and reorder to <U+0041, U+05B0, U+0328>, as U+05B0 has a
>
> higher combining class (202) than U+05B0 (10). What does this become in
> NFC? Is the reordering reversed and the combination reapplied?

First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in <U+0104, U+05B0>.

It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.

> This is not only a theoretical issue as the same applies to some real
> combinations. There was discussion only last week on the bidi list of a
> form which might be encoded <U+064A, U+0652, U+0654> but which would be
>
> messed up if composed into <U+0626, U+0652>.

Yes, NFC would perform that composition. Are you sure it would be an issue? Applying bidi rules doesn't seem to make this an issue.
<U+064A, U+0652, U+0654>
bidi: Al, NSM, NSM
applying rule W1 from USA9:
Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.

<U+0626, U+0652>
bidi: Al, NSM
applying rule W1:
Al, NSM -> Al, Al

Or is the issue with something else, but it came up on the bidi list?

This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 08:22:31 EDT