Re: Folding algorithm and canonical equivalence

From: Peter Kirk (
Date: Sun Jul 18 2004 - 17:44:43 CDT

  • Next message: Doug Ewell: "Re: Writing Tatar using the Latin script; new characters to encode?"

    On 18/07/2004 22:15, Asmus Freytag wrote:

    > At 05:25 AM 7/18/2004, Peter Kirk wrote:
    >> I accept that there might be some script-specific cases in which
    >> particular accents should not be removed. The breve in Cyrillic i
    >> kratkoe might be an example; but then this might be rather too
    >> language-specific as well. But these should be clearly defined and
    >> justified exceptions, rather than their possible existence being a
    >> reason to restrict the general applicability of accent and diacritic
    >> folding.
    > I was thinking rather more of Khmer, where a some characters that are
    > considered letters are given gc=Mn. In that case, folding would be
    > very inappropriate.
    > So the answer has to be to limit the removal of diacritical marks in
    > AccentFolding, to those that are truly *accents*. That's a subset of
    > gc=Mn. There are two options for a starting set:
    > select all 'accents' (note, not baseforms) that occur in some
    > precomposed character. And then add additional ones on a case by case
    > basis (e.g. stroke overlay).
    > Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter
    > will be part of 4.1), and make some principled additions / deletions.

    This sounds good to me. Among the additions should be all Hebrew
    combining marks unless this is done separately.

    > All script-specific non-spacing marks for Indic scripts etc; should
    > not be part of 'AccentFolding', in my opinion.
    >> .. when I look more closely at AccentFolding as defined I see a
    >> problem with it. It is specified as affecting only
    >> "Latin/Greek/Cyrillic characters with canonical decomposition". But
    >> this is inadequate because there are many cases of
    >> Latin/Greek/Cyrillic characters (and most cases of Hebrew ones) where
    >> an accent should be removed even though there is no precomposed form
    >> encoded and so canonical decomposition
    > Correct. Whatever the set of combining marks is, we then need to
    > define a set of base characters. We could simply use sc=Latin +
    > sc=Greek + sc=Cyrillic as a starting set, to treat all accented
    > character equally.
    > What about other scripts:
    > If you feel that Hebrew folding to unpointed is something that should
    > happen everytime other accents are folded, we can add Hebrew (or we
    > can make a separate fodling, HebrewMarksFolding,
    > that people can invoke optionally) I tend to prefer the latter. Since
    > for Hebrew (the languages), a folding to unpointed might be one of the
    > foldings that someone might want to apply permanently, it should be
    > separtely named and defined, on the principle that the foldings should
    > be building blocks.

    Agreed that it should be separate, but I would also see it as included
    as a subset within the regular accent folding.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 17:45:24 CDT