Re: Folding algorithm and canonical equivalence

From: Peter Kirk (
Date: Sun Jul 18 2004 - 07:25:33 CDT

  • Next message: Peter Kirk: "Re: Folding algorithm and canonical equivalence"

    On 18/07/2004 08:52, Asmus Freytag wrote:

    > At 11:15 PM 7/17/2004, John Cowan wrote:
    >> I agree that in the TR#30 context, the Right Thing is to remove the
    >> character pair mappings altogether, and all of the single-character
    >> mappings that have canonical decompositions
    > In other words, in your opinion, the reasonable thing to do would be
    > for someone to do the AccentFolding as defined in the TR, and then do
    > a DiacriticFolding, to fold the cases where even in NFD accents don't
    > exist as as separate characters.

    This is not quite what I had in mind, but only because when I look more
    closely at AccentFolding as defined I see a problem with it. It is
    specified as affecting only "Latin/Greek/Cyrillic characters with
    canonical decomposition". But this is inadequate because there are many
    cases of Latin/Greek/Cyrillic characters (and most cases of Hebrew ones)
    where an accent should be removed even though there is no precomposed
    form encoded and so canonical decomposition. This definition needs to be
    extended to deletion of all accents, i.e. probably all non-spacing
    combining marks, regardless of whether there is a canonical
    decomposition, at least when the base character is
    Latin/Greek/Cyrillic/Hebrew (and probably also at least Arabic and
    Syriac, in which combining marks function much as in Hebrew).

    Such an extended AccentFolding would then function as a good base for a
    broader DiacriticFolding.

    > That's certainly reasonable and not the only case where it's
    > interesting to have chained foldings.
    > Jony is arguing to extend AccentFolding to Hebrew (fold to unpointed).
    > His suggestion is to fold *all* combining marks used with Hebrew in
    > that case. I want to double check that he really means all combining
    > marks in the Hebrew block, or just some of them.
    > AccentFolding can't just fold all gc=Mn, since that would include
    > quite a few that are script specific as well as the marks for Symbols,
    > for which different folding rules might need to apply in some context.
    > So I think I'll use as the set of accents to remove all the ones that
    > show up as part of decompositions, ...

    This restriction will end up with some ridiculous results if applied to
    a language in which only some of the regular letters are supported as
    precomposed forms: the identical mark will be stripped from some base
    characters but not from others.

    I would suggest that if "different folding rules might need to apply in
    some context", a different folding should be applied rather than trying
    to overload an existing folding whose function is supposed to be to
    remove accents or diacritics. If it removes some accents or diacritics
    from some base characters, but does not remove all from all, users will
    simply reject the folding as unreliable.

    I accept that there might be some script-specific cases in which
    particular accents should not be removed. The breve in Cyrillic i
    kratkoe might be an example; but then this might be rather too
    language-specific as well. But these should be clearly defined and
    justified exceptions, rather than their possible existence being a
    reason to restrict the general applicability of accent and diacritic

    > ... plus as many Hebrew accents that Jony can confirm.
    > (another alternative would be to make the Hebrew folding a separate
    > definition, to allow people to apply one, but not the other.)
    > I'll make another Draft of DiacriticFolding.txt with the canonical
    > decomp derivables removed.
    > A./

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 07:27:30 CDT