Back to the subject: Folding algorithm and canonical equivalence

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jul 19 2004 - 15:21:51 CDT

Next message: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"

Previous message: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
In reply to: Peter Kirk: "Folding algorithm and canonical equivalence"
Next in thread: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"
Reply: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There has been extensive discussion in this thread on the specifics of
accent and diacritic folding. But no one has answered my point, repeated
below, that there seems to be a conflict between the folding algorithm
(rather than the details of specific foldings) and the principle of
canonical equivalence. Specifically, it seems to breach the principle in
Unicode Conformance Clause C9:

> Ideally, an implementation would always interpret two
> canonical-equivalent character
> sequences identically. There are practical circumstances under which
> implementations
> may reasonably distinguish them.

Are the authors of UTR #30 claiming that folding is one of those
practical circumstances, or is this just an oversight?

Peter Kirk

On 17/07/2004 23:25, Peter Kirk wrote:

> I was just reviewing the UTR #30 draft in response to Rick's notice
> about it. And I believe I may have found a point in which the folding
> algorithm as given may violate the principle of canonical equivalence.
> But I would like some clarification from list members before providing
> formal input on this point.
>
> Consider a sequence made up of a base character B and two combining
> marks M1 and M2, in which the combining class of M1 is less than that
> of M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent
> representations of the same sequence, but only the former is in
> canonical order. Suppose that a folding is defined including the
> operation <B, M2> -> X, but no other relevant operations. When this
> folding is applied, according to the folding algorithms defined in
> sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
> sequence <B, M2, M1> will be folded to <X, M1> and will not be further
> changed, but the sequence <B, M1, M2> will not be changed at all by
> the folding because the sequence <B, M2> will never be found. (By
> contrast, a folding operation <B, M1> -> Y will be applied to both
> sequences, because the canonical decomposition step converts <B, M2,
> M1> to <B, M1, M2> and the folding operation is re-applied and finds a
> match the second time.) The implication is that folding of two
> canonically equivalent strings gives different (and not canonically
> equivalent) results.
>
> This is not a purely theoretical point. The Diacritic Folding as
> specified in
> http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
> includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET,
> but no general rule to delete DAGESH (or any other combining marks; I
> think there needs to be such a rule, and I have already posted a
> formal response saying that). Sequences like <BET, DAGESH, PATAH> are
> very common in Hebrew text, and commonly written in this order which
> is logically correct and preferred by current rendering technologies,
> but the canonical order is in fact <BET, PATAH, DAGESH>; thus both
> sequences will be found in data depending on whether or not it has
> been normalised. The effect of applying Diacritic Folding exactly as
> specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but
> the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact
> I consider that both should be folded to just BET, but that is not
> what the current data file specifies.)
>
> I hope I have not totally misunderstood the folding algorithm here.
> But it seems to me that what is missing in the algorithm is an initial
> step of normalising the data. The introductory text to section 4 seems
> to suggest that this has been avoided because folding may need to
> preserve the distinction between NFC and NFD data - although the
> algorithm as presented does not in fact do this. Since in practice the
> input data is not necessarily in either NFC or NFD and there is no
> easy way to detect which is being used, the only meaningful approach
> is for the user of the folding to specify whether the output of the
> folding should be NFC or NFD.
>
> Of course there might be a real requirement for a folding which, for
> example, removes DAGESH when combined with BET (but not with other
> base characters) irrespective of what other combining marks might
> intervene. But such foldings would need a considerably more powerful
> folding algorithm.
>

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"
Previous message: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
In reply to: Peter Kirk: "Folding algorithm and canonical equivalence"
Next in thread: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"
Reply: Mark Davis: "Re: Back to the subject: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 15:23:58 CDT