Re: Folding algorithm and canonical equivalence

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Jul 18 2004 - 16:15:49 CDT

Next message: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"

Previous message: Asmus Freytag: "RE: Folding algorithm and canonical equivalence"
In reply to: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Next in thread: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Reply: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Reply: John Cowan: "Re: Folding algorithm and canonical equivalence"
Reply: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 05:25 AM 7/18/2004, Peter Kirk wrote:
>I accept that there might be some script-specific cases in which
>particular accents should not be removed. The breve in Cyrillic i kratkoe
>might be an example; but then this might be rather too language-specific
>as well. But these should be clearly defined and justified exceptions,
>rather than their possible existence being a reason to restrict the
>general applicability of accent and diacritic folding.

I was thinking rather more of Khmer, where a some characters that are
considered letters are given gc=Mn. In that case, folding would be very
inappropriate.

So the answer has to be to limit the removal of diacritical marks in
AccentFolding, to those that are truly *accents*. That's a subset of gc=Mn.
There are two options for a starting set:
select all 'accents' (note, not baseforms) that occur in some precomposed
character. And then add additional ones on a case by case basis (e.g.
stroke overlay).

Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter will be
part of 4.1), and make some principled additions / deletions.

All script-specific non-spacing marks for Indic scripts etc; should not be
part of 'AccentFolding', in my opinion.

>.. when I look more closely at AccentFolding as defined I see a problem
>with it. It is specified as affecting only "Latin/Greek/Cyrillic
>characters with canonical decomposition". But this is inadequate because
>there are many cases of Latin/Greek/Cyrillic characters (and most cases of
>Hebrew ones) where an accent should be removed even though there is no
>precomposed form encoded and so canonical decomposition

Correct. Whatever the set of combining marks is, we then need to define a
set of base characters. We could simply use sc=Latin + sc=Greek +
sc=Cyrillic as a starting set, to treat all accented character equally.

What about other scripts:

If you feel that Hebrew folding to unpointed is something that should
happen everytime other accents are folded, we can add Hebrew (or we can
make a separate fodling, HebrewMarksFolding,
that people can invoke optionally) I tend to prefer the latter. Since for
Hebrew (the languages), a folding to unpointed might be one of the foldings
that someone might want to apply permanently, it should be separtely named
and defined, on the principle that the foldings should be building blocks.

A./

Next message: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"
Previous message: Asmus Freytag: "RE: Folding algorithm and canonical equivalence"
In reply to: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Next in thread: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Reply: Peter Kirk: "Re: Folding algorithm and canonical equivalence"
Reply: John Cowan: "Re: Folding algorithm and canonical equivalence"
Reply: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 16:16:32 CDT