Re: BOM as WJ?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 19 2003 - 19:26:03 EST

  • Next message: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"

    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > So, <NBSP,CC> must not be treated as if it was:
    > <WJ,SP,WJ,CC>
    > but really rather as:
    > <WJ,SP,CC,WJ>
    > Note here the inversion.

    The inversion here acts as if WJ was a combining character of combining
    class 256 (i.e. with a class higher than the combining class of all other
    "Mn" combining characters) and a canonical reordering was applied to the
    sequence.

    Of course this is not a standard normalization form, but using this pseudo
    combining class may help render the last two coded strings (in my quote
    above) equivalently in renderers.
    This works even in the case where there are multiple diacritics (noted CC1
    and CC2 below):
        <NBSP,CC1,CC2>
    is then treated as if it was:
        <WJ,SP,WJ,CC1,CC2>
    and then the pseudo-normalization had given:
        <WJ,SP,CC1,CC2,WJ>
    or:
        <WJ,SP,CC2,CC1,WJ>
    (depending on the canonical reordering of CC1 and CC2, i.e. of their
    relative combining class)



    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 20:10:10 EST