Re: Folding algorithm and canonical equivalence

From: Asmus Freytag (
Date: Sun Jul 18 2004 - 16:15:49 CDT

  • Next message: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"

    At 05:25 AM 7/18/2004, Peter Kirk wrote:
    >I accept that there might be some script-specific cases in which
    >particular accents should not be removed. The breve in Cyrillic i kratkoe
    >might be an example; but then this might be rather too language-specific
    >as well. But these should be clearly defined and justified exceptions,
    >rather than their possible existence being a reason to restrict the
    >general applicability of accent and diacritic folding.

    I was thinking rather more of Khmer, where a some characters that are
    considered letters are given gc=Mn. In that case, folding would be very

    So the answer has to be to limit the removal of diacritical marks in
    AccentFolding, to those that are truly *accents*. That's a subset of gc=Mn.
    There are two options for a starting set:
    select all 'accents' (note, not baseforms) that occur in some precomposed
    character. And then add additional ones on a case by case basis (e.g.
    stroke overlay).

    Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter will be
    part of 4.1), and make some principled additions / deletions.

    All script-specific non-spacing marks for Indic scripts etc; should not be
    part of 'AccentFolding', in my opinion.

    >.. when I look more closely at AccentFolding as defined I see a problem
    >with it. It is specified as affecting only "Latin/Greek/Cyrillic
    >characters with canonical decomposition". But this is inadequate because
    >there are many cases of Latin/Greek/Cyrillic characters (and most cases of
    >Hebrew ones) where an accent should be removed even though there is no
    >precomposed form encoded and so canonical decomposition

    Correct. Whatever the set of combining marks is, we then need to define a
    set of base characters. We could simply use sc=Latin + sc=Greek +
    sc=Cyrillic as a starting set, to treat all accented character equally.

    What about other scripts:

    If you feel that Hebrew folding to unpointed is something that should
    happen everytime other accents are folded, we can add Hebrew (or we can
    make a separate fodling, HebrewMarksFolding,
    that people can invoke optionally) I tend to prefer the latter. Since for
    Hebrew (the languages), a folding to unpointed might be one of the foldings
    that someone might want to apply permanently, it should be separtely named
    and defined, on the principle that the foldings should be building blocks.


    This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 16:16:32 CDT