Back to the subject: Folding algorithm and canonical equivalence

From: Peter Kirk (
Date: Mon Jul 19 2004 - 15:21:51 CDT

  • Next message: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"

    There has been extensive discussion in this thread on the specifics of
    accent and diacritic folding. But no one has answered my point, repeated
    below, that there seems to be a conflict between the folding algorithm
    (rather than the details of specific foldings) and the principle of
    canonical equivalence. Specifically, it seems to breach the principle in
    Unicode Conformance Clause C9:

    > Ideally, an implementation would always interpret two
    > canonical-equivalent character
    > sequences identically. There are practical circumstances under which
    > implementations
    > may reasonably distinguish them.

    Are the authors of UTR #30 claiming that folding is one of those
    practical circumstances, or is this just an oversight?

    Peter Kirk

    On 17/07/2004 23:25, Peter Kirk wrote:

    > I was just reviewing the UTR #30 draft in response to Rick's notice
    > about it. And I believe I may have found a point in which the folding
    > algorithm as given may violate the principle of canonical equivalence.
    > But I would like some clarification from list members before providing
    > formal input on this point.
    > Consider a sequence made up of a base character B and two combining
    > marks M1 and M2, in which the combining class of M1 is less than that
    > of M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent
    > representations of the same sequence, but only the former is in
    > canonical order. Suppose that a folding is defined including the
    > operation <B, M2> -> X, but no other relevant operations. When this
    > folding is applied, according to the folding algorithms defined in
    > sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
    > sequence <B, M2, M1> will be folded to <X, M1> and will not be further
    > changed, but the sequence <B, M1, M2> will not be changed at all by
    > the folding because the sequence <B, M2> will never be found. (By
    > contrast, a folding operation <B, M1> -> Y will be applied to both
    > sequences, because the canonical decomposition step converts <B, M2,
    > M1> to <B, M1, M2> and the folding operation is re-applied and finds a
    > match the second time.) The implication is that folding of two
    > canonically equivalent strings gives different (and not canonically
    > equivalent) results.
    > This is not a purely theoretical point. The Diacritic Folding as
    > specified in
    > includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET,
    > but no general rule to delete DAGESH (or any other combining marks; I
    > think there needs to be such a rule, and I have already posted a
    > formal response saying that). Sequences like <BET, DAGESH, PATAH> are
    > very common in Hebrew text, and commonly written in this order which
    > is logically correct and preferred by current rendering technologies,
    > but the canonical order is in fact <BET, PATAH, DAGESH>; thus both
    > sequences will be found in data depending on whether or not it has
    > been normalised. The effect of applying Diacritic Folding exactly as
    > specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but
    > the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact
    > I consider that both should be folded to just BET, but that is not
    > what the current data file specifies.)
    > I hope I have not totally misunderstood the folding algorithm here.
    > But it seems to me that what is missing in the algorithm is an initial
    > step of normalising the data. The introductory text to section 4 seems
    > to suggest that this has been avoided because folding may need to
    > preserve the distinction between NFC and NFD data - although the
    > algorithm as presented does not in fact do this. Since in practice the
    > input data is not necessarily in either NFC or NFD and there is no
    > easy way to detect which is being used, the only meaningful approach
    > is for the user of the folding to specify whether the output of the
    > folding should be NFC or NFD.
    > Of course there might be a real requirement for a folding which, for
    > example, removes DAGESH when combined with BET (but not with other
    > base characters) irrespective of what other combining marks might
    > intervene. But such foldings would need a considerably more powerful
    > folding algorithm.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 15:23:58 CDT