RE: Folding algorithm and canonical equivalence

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jul 19 2004 - 01:03:23 CDT

  • Next message: Michael Everson: "Re: Folding algorithm and canonical equivalence"

    At 07:53 PM 7/18/2004, Jony Rosenne wrote:
    >By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.

    Latin/Greek/Cyrillic share the fact that for searches you may want to
    remove accents, but, except for very unusual circumstances, it's not a good
    idea to transform text permanently.

    If I understand the situation for Hebrew correctly, unpointed Hebrew is
    quite valid on its own, and the situations where someone might want to use
    that as a transform are more widespread. If that is true, breaking it out
    into separate files allows one to take mixed French / Hebrew text and
    transform the Hebrew while not affecting the French.

    The other reason is that, again as far as I can understand this, generic
    diacritics are not used with Hebrew (except perhaps for some highly
    technical texts). Therefore it would be easier to specify it as the removal
    of any marks with the Hebrew script code

    HebrewAccentFolding ; sc = Hebrew & gc=Mn; <null>

    >I think there should be a single diacritics removal folding, which should be
    >tailorable.

    The generic diacritic folding would then be built up as follows:

    DiacriticRemoval = AccentFolding + OtherDiacriticFolding +
    HebrewAccentFolding + ArabicSyriacFolding....

    where 'HebrewAccentFolding' is as defined above, OtherDiacriticFolding
    would be the set remaining in the current DiacriticFolding.txt after
    canonical decompositions are removed, and ArabicSyriacFolding is defined
    along the same lines as HebrewAccentFolding.

    Voila, you have your generic label to invoke DiacriticRemoval, but the
    pieces are still accessible in reasonable chunks.

    A./



    This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 01:04:50 CDT