RE: Folding algorithm and canonical equivalence

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jul 19 2004 - 01:03:23 CDT

Next message: Michael Everson: "Re: Folding algorithm and canonical equivalence"

Previous message: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
In reply to: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 07:53 PM 7/18/2004, Jony Rosenne wrote:
>By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.

Latin/Greek/Cyrillic share the fact that for searches you may want to
remove accents, but, except for very unusual circumstances, it's not a good
idea to transform text permanently.

If I understand the situation for Hebrew correctly, unpointed Hebrew is
quite valid on its own, and the situations where someone might want to use
that as a transform are more widespread. If that is true, breaking it out
into separate files allows one to take mixed French / Hebrew text and
transform the Hebrew while not affecting the French.

The other reason is that, again as far as I can understand this, generic
diacritics are not used with Hebrew (except perhaps for some highly
technical texts). Therefore it would be easier to specify it as the removal
of any marks with the Hebrew script code

HebrewAccentFolding ; sc = Hebrew & gc=Mn; <null>

>I think there should be a single diacritics removal folding, which should be
>tailorable.

The generic diacritic folding would then be built up as follows:

DiacriticRemoval = AccentFolding + OtherDiacriticFolding +
HebrewAccentFolding + ArabicSyriacFolding....

where 'HebrewAccentFolding' is as defined above, OtherDiacriticFolding
would be the set remaining in the current DiacriticFolding.txt after
canonical decompositions are removed, and ArabicSyriacFolding is defined
along the same lines as HebrewAccentFolding.

Voila, you have your generic label to invoke DiacriticRemoval, but the
pieces are still accessible in reasonable chunks.

A./

Next message: Michael Everson: "Re: Folding algorithm and canonical equivalence"
Previous message: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
In reply to: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 01:04:50 CDT