Re: Back to the subject: Folding algorithm and canonical equivalence

From: Peter Kirk (
Date: Mon Jul 19 2004 - 17:43:42 CDT

  • Next message: E. Keown: "Re: Folding algorithm and canonical equivalence"

    On 19/07/2004 23:23, Asmus Freytag wrote:

    > At 01:56 PM 7/19/2004, Mark Davis wrote:
    >> You did point out an oversight; Asmus and I have been working on the
    >> issue.
    >> ‚ÄéMark
    > As Mark wrote, your point is taken and we've taken that onboard.
    > However, we won't try to *edit* text on the list, that's why we are
    > not engaging in a long discussion on the details (and we've discovered
    > many interesting ones, wait for the next version of the text).
    > In my replies I tend to focus on issues for which I need more
    > information.

    Fair enough. I just wondered if I needed to raise this one as a formal
    feedback issue. From what you say here, I assume not.

    > A./
    > PS: Just one final comment:
    >>> Ideally, an implementation would always interpret two
    >>> canonical-equivalent character
    >>> sequences identically. There are practical circumstances under which
    >>> implementations
    >>> may reasonably distinguish them.
    >> Are the authors of UTR #30 claiming that folding is one of those
    >> practical circumstances, or is this just an oversight?
    > As it turns out, and not surprisingly, realizing that ideal for any
    > arbitrary type of possible folding rule can get complicated (again, I
    > won't go into details right now). There may be situations were an
    > optimization would break canonical equivalence in the face of
    > permissible, but unusual, if not to say 'non-sensical' input. That's
    > what's meant with 'practical circumstances'.
    > If the ability to 'correctly' handle combining sequences that are a
    > random mixture of Khmer and Arabic combining marks were to result in
    > severe runtime penalties, would you rather have a 'correct' or a fast
    > implementation?

    Again, fair enough. But I would be surprised if this is a real issue
    with the folding algorithm. Indeed I would expect, given that
    decomposition, presumably to NFD, is anyway required after the first
    folding pass, that there would be little or no performance hit in
    normalising the text to be folded to NFD before the first folding pass.

    > Nobody argues that sequences that are expected to occur in realistic
    > data, including specialized texts, definitely should be handled as
    > expected, even where practicalities require some optimizations.

    Yes, but I did make the point that the issue I brought up is not a
    purely theoretical one, but a very real one for Hebrew with the
    diacritic removal folding as defined.

    > So, we are all agred.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 17:44:51 CDT