From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jul 19 2004 - 01:03:23 CDT
At 07:53 PM 7/18/2004, Jony Rosenne wrote:
>By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.
Latin/Greek/Cyrillic share the fact that for searches you may want to
remove accents, but, except for very unusual circumstances, it's not a good
idea to transform text permanently.
If I understand the situation for Hebrew correctly, unpointed Hebrew is
quite valid on its own, and the situations where someone might want to use
that as a transform are more widespread. If that is true, breaking it out
into separate files allows one to take mixed French / Hebrew text and
transform the Hebrew while not affecting the French.
The other reason is that, again as far as I can understand this, generic
diacritics are not used with Hebrew (except perhaps for some highly
technical texts). Therefore it would be easier to specify it as the removal
of any marks with the Hebrew script code
HebrewAccentFolding ; sc = Hebrew & gc=Mn; <null>
>I think there should be a single diacritics removal folding, which should be
>tailorable.
The generic diacritic folding would then be built up as follows:
DiacriticRemoval = AccentFolding + OtherDiacriticFolding +
HebrewAccentFolding + ArabicSyriacFolding....
where 'HebrewAccentFolding' is as defined above, OtherDiacriticFolding
would be the set remaining in the current DiacriticFolding.txt after
canonical decompositions are removed, and ArabicSyriacFolding is defined
along the same lines as HebrewAccentFolding.
Voila, you have your generic label to invoke DiacriticRemoval, but the
pieces are still accessible in reasonable chunks.
A./
This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 01:04:50 CDT