From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jul 19 2004 - 01:03:23 CDT
At 07:53 PM 7/18/2004, Jony Rosenne wrote:
>By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.
Latin/Greek/Cyrillic share the fact that for searches you may want to 
remove accents, but, except for very unusual circumstances, it's not a good 
idea to transform text permanently.
If I understand the situation for Hebrew correctly, unpointed Hebrew is 
quite valid on its own, and the situations where someone might want to use 
that as a transform are more widespread. If that is true, breaking it out 
into separate files allows one to take mixed French / Hebrew text and 
transform the Hebrew while not affecting the French.
The other reason is that, again as far as I can understand this, generic 
diacritics are not used with Hebrew (except perhaps for some highly 
technical texts). Therefore it would be easier to specify it as the removal 
of any marks with the Hebrew script code
HebrewAccentFolding ; sc = Hebrew & gc=Mn; <null>
>I think there should be a single diacritics removal folding, which should be
>tailorable.
The generic diacritic folding would then be built up as follows:
DiacriticRemoval = AccentFolding + OtherDiacriticFolding + 
HebrewAccentFolding + ArabicSyriacFolding....
where 'HebrewAccentFolding' is as defined above, OtherDiacriticFolding 
would be the set remaining in the current DiacriticFolding.txt after 
canonical decompositions are removed, and ArabicSyriacFolding is defined 
along the same lines as HebrewAccentFolding.
Voila, you have your generic label to invoke DiacriticRemoval, but the 
pieces are still accessible in reasonable chunks.
A./ 
This archive was generated by hypermail 2.1.5 : Mon Jul 19 2004 - 01:04:50 CDT