Re: Back to the subject: Folding algorithm and canonical equivalence

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jul 19 2004 - 15:56:10 CDT

  • Next message: fantasai: "Resolving Weak and Neutral Types [BIDI]"

    You did point out an oversight; Asmus and I have been working on the issue.

    ‚ÄéMark

    ----- Original Message -----
    From: "Peter Kirk" <peterkirk@qaya.org>
    To: "Unicode List" <unicode@unicode.org>
    Sent: Monday, July 19, 2004 13:21
    Subject: Back to the subject: Folding algorithm and canonical equivalence

    > There has been extensive discussion in this thread on the specifics of
    > accent and diacritic folding. But no one has answered my point, repeated
    > below, that there seems to be a conflict between the folding algorithm
    > (rather than the details of specific foldings) and the principle of
    > canonical equivalence. Specifically, it seems to breach the principle in
    > Unicode Conformance Clause C9:
    >
    > > Ideally, an implementation would always interpret two
    > > canonical-equivalent character
    > > sequences identically. There are practical circumstances under which
    > > implementations
    > > may reasonably distinguish them.
    >
    > Are the authors of UTR #30 claiming that folding is one of those
    > practical circumstances, or is this just an oversight?
    >
    > Peter Kirk
    >
    > On 17/07/2004 23:25, Peter Kirk wrote:
    >
    > > I was just reviewing the UTR #30 draft in response to Rick's notice
    > > about it. And I believe I may have found a point in which the folding
    > > algorithm as given may violate the principle of canonical equivalence.
    > > But I would like some clarification from list members before providing
    > > formal input on this point.
    > >
    > > Consider a sequence made up of a base character B and two combining
    > > marks M1 and M2, in which the combining class of M1 is less than that
    > > of M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent
    > > representations of the same sequence, but only the former is in
    > > canonical order. Suppose that a folding is defined including the
    > > operation <B, M2> -> X, but no other relevant operations. When this
    > > folding is applied, according to the folding algorithms defined in
    > > sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
    > > sequence <B, M2, M1> will be folded to <X, M1> and will not be further
    > > changed, but the sequence <B, M1, M2> will not be changed at all by
    > > the folding because the sequence <B, M2> will never be found. (By
    > > contrast, a folding operation <B, M1> -> Y will be applied to both
    > > sequences, because the canonical decomposition step converts <B, M2,
    > > M1> to <B, M1, M2> and the folding operation is re-applied and finds a
    > > match the second time.) The implication is that folding of two
    > > canonically equivalent strings gives different (and not canonically
    > > equivalent) results.
    > >
    > > This is not a purely theoretical point. The Diacritic Folding as
    > > specified in
    > > http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
    > > includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET,
    > > but no general rule to delete DAGESH (or any other combining marks; I
    > > think there needs to be such a rule, and I have already posted a
    > > formal response saying that). Sequences like <BET, DAGESH, PATAH> are
    > > very common in Hebrew text, and commonly written in this order which
    > > is logically correct and preferred by current rendering technologies,
    > > but the canonical order is in fact <BET, PATAH, DAGESH>; thus both
    > > sequences will be found in data depending on whether or not it has
    > > been normalised. The effect of applying Diacritic Folding exactly as
    > > specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but
    > > the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact
    > > I consider that both should be folded to just BET, but that is not
    > > what the current data file specifies.)
    > >
    > > I hope I have not totally misunderstood the folding algorithm here.
    > > But it seems to me that what is missing in the algorithm is an initial
    > > step of normalising the data. The introductory text to section 4 seems
    > > to suggest that this has been avoided because folding may need to
    > > preserve the distinction between NFC and NFD data - although the
    > > algorithm as presented does not in fact do this. Since in practice the
    > > input data is not necessarily in either NFC or NFD and there is no
    > > easy way to detect which is being used, the only meaningful approach
    > > is for the user of the folding to specify whether the output of the
    > > folding should be NFC or NFD.
    > >
    > > Of course there might be a real requirement for a folding which, for
    > > example, removes DAGESH when combined with BET (but not with other
    > > base characters) irrespective of what other combining marks might
    > > intervene. But such foldings would need a considerably more powerful
    > > folding algorithm.
    > >
    >
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 15:57:26 CDT