From: Peter Kirk (firstname.lastname@example.org)
Date: Sun Jul 18 2004 - 07:25:33 CDT
On 18/07/2004 08:52, Asmus Freytag wrote:
> At 11:15 PM 7/17/2004, John Cowan wrote:
>> I agree that in the TR#30 context, the Right Thing is to remove the
>> character pair mappings altogether, and all of the single-character
>> mappings that have canonical decompositions
> In other words, in your opinion, the reasonable thing to do would be
> for someone to do the AccentFolding as defined in the TR, and then do
> a DiacriticFolding, to fold the cases where even in NFD accents don't
> exist as as separate characters.
This is not quite what I had in mind, but only because when I look more
closely at AccentFolding as defined I see a problem with it. It is
specified as affecting only "Latin/Greek/Cyrillic characters with
canonical decomposition". But this is inadequate because there are many
cases of Latin/Greek/Cyrillic characters (and most cases of Hebrew ones)
where an accent should be removed even though there is no precomposed
form encoded and so canonical decomposition. This definition needs to be
extended to deletion of all accents, i.e. probably all non-spacing
combining marks, regardless of whether there is a canonical
decomposition, at least when the base character is
Latin/Greek/Cyrillic/Hebrew (and probably also at least Arabic and
Syriac, in which combining marks function much as in Hebrew).
Such an extended AccentFolding would then function as a good base for a
> That's certainly reasonable and not the only case where it's
> interesting to have chained foldings.
> Jony is arguing to extend AccentFolding to Hebrew (fold to unpointed).
> His suggestion is to fold *all* combining marks used with Hebrew in
> that case. I want to double check that he really means all combining
> marks in the Hebrew block, or just some of them.
> AccentFolding can't just fold all gc=Mn, since that would include
> quite a few that are script specific as well as the marks for Symbols,
> for which different folding rules might need to apply in some context.
> So I think I'll use as the set of accents to remove all the ones that
> show up as part of decompositions, ...
This restriction will end up with some ridiculous results if applied to
a language in which only some of the regular letters are supported as
precomposed forms: the identical mark will be stripped from some base
characters but not from others.
I would suggest that if "different folding rules might need to apply in
some context", a different folding should be applied rather than trying
to overload an existing folding whose function is supposed to be to
remove accents or diacritics. If it removes some accents or diacritics
from some base characters, but does not remove all from all, users will
simply reject the folding as unreliable.
I accept that there might be some script-specific cases in which
particular accents should not be removed. The breve in Cyrillic i
kratkoe might be an example; but then this might be rather too
language-specific as well. But these should be clearly defined and
justified exceptions, rather than their possible existence being a
reason to restrict the general applicability of accent and diacritic
> ... plus as many Hebrew accents that Jony can confirm.
> (another alternative would be to make the Hebrew folding a separate
> definition, to allow people to apply one, but not the other.)
> I'll make another Draft of DiacriticFolding.txt with the canonical
> decomp derivables removed.
-- Peter Kirk email@example.com (personal) firstname.lastname@example.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 07:27:30 CDT