RE: Folding algorithm and canonical equivalence

From: Jony Rosenne (
Date: Sun Jul 18 2004 - 21:53:27 CDT

  • Next message: Asmus Freytag: "RE: Folding algorithm and canonical equivalence"

    By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.

    I think there should be a single diacritics removal folding, which should be


    > -----Original Message-----
    > From:
    > [] On Behalf Of Asmus Freytag
    > Sent: Monday, July 19, 2004 12:16 AM
    > To: Peter Kirk
    > Cc: John Cowan; Unicode List; jony Rosenne
    > Subject: Re: Folding algorithm and canonical equivalence
    > At 05:25 AM 7/18/2004, Peter Kirk wrote:
    > >I accept that there might be some script-specific cases in which
    > >particular accents should not be removed. The breve in
    > Cyrillic i kratkoe
    > >might be an example; but then this might be rather too
    > language-specific
    > >as well. But these should be clearly defined and justified
    > exceptions,
    > >rather than their possible existence being a reason to restrict the
    > >general applicability of accent and diacritic folding.
    > I was thinking rather more of Khmer, where a some characters that are
    > considered letters are given gc=Mn. In that case, folding
    > would be very
    > inappropriate.
    > So the answer has to be to limit the removal of diacritical marks in
    > AccentFolding, to those that are truly *accents*. That's a
    > subset of gc=Mn.
    > There are two options for a starting set:
    > select all 'accents' (note, not baseforms) that occur in some
    > precomposed
    > character. And then add additional ones on a case by case basis (e.g.
    > stroke overlay).
    > Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the
    > latter will be
    > part of 4.1), and make some principled additions / deletions.
    > All script-specific non-spacing marks for Indic scripts etc;
    > should not be
    > part of 'AccentFolding', in my opinion.
    > >.. when I look more closely at AccentFolding as defined I
    > see a problem
    > >with it. It is specified as affecting only "Latin/Greek/Cyrillic
    > >characters with canonical decomposition". But this is
    > inadequate because
    > >there are many cases of Latin/Greek/Cyrillic characters (and
    > most cases of
    > >Hebrew ones) where an accent should be removed even though
    > there is no
    > >precomposed form encoded and so canonical decomposition
    > Correct. Whatever the set of combining marks is, we then need
    > to define a
    > set of base characters. We could simply use sc=Latin + sc=Greek +
    > sc=Cyrillic as a starting set, to treat all accented
    > character equally.
    > What about other scripts:
    > If you feel that Hebrew folding to unpointed is something that should
    > happen everytime other accents are folded, we can add Hebrew
    > (or we can
    > make a separate fodling, HebrewMarksFolding,
    > that people can invoke optionally) I tend to prefer the
    > latter. Since for
    > Hebrew (the languages), a folding to unpointed might be one
    > of the foldings
    > that someone might want to apply permanently, it should be
    > separtely named
    > and defined, on the principle that the foldings should be
    > building blocks.
    > A./

    This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 21:57:45 CDT