From: Jony Rosenne (rosennej@qsm.co.il)
Date: Sun Jul 18 2004 - 21:53:27 CDT
By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.
I think there should be a single diacritics removal folding, which should be
tailorable.
Jony
> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] On Behalf Of Asmus Freytag
> Sent: Monday, July 19, 2004 12:16 AM
> To: Peter Kirk
> Cc: John Cowan; Unicode List; jony Rosenne
> Subject: Re: Folding algorithm and canonical equivalence
>
>
> At 05:25 AM 7/18/2004, Peter Kirk wrote:
> >I accept that there might be some script-specific cases in which
> >particular accents should not be removed. The breve in
> Cyrillic i kratkoe
> >might be an example; but then this might be rather too
> language-specific
> >as well. But these should be clearly defined and justified
> exceptions,
> >rather than their possible existence being a reason to restrict the
> >general applicability of accent and diacritic folding.
>
> I was thinking rather more of Khmer, where a some characters that are
> considered letters are given gc=Mn. In that case, folding
> would be very
> inappropriate.
>
> So the answer has to be to limit the removal of diacritical marks in
> AccentFolding, to those that are truly *accents*. That's a
> subset of gc=Mn.
> There are two options for a starting set:
> select all 'accents' (note, not baseforms) that occur in some
> precomposed
> character. And then add additional ones on a case by case basis (e.g.
> stroke overlay).
>
> Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the
> latter will be
> part of 4.1), and make some principled additions / deletions.
>
> All script-specific non-spacing marks for Indic scripts etc;
> should not be
> part of 'AccentFolding', in my opinion.
>
> >.. when I look more closely at AccentFolding as defined I
> see a problem
> >with it. It is specified as affecting only "Latin/Greek/Cyrillic
> >characters with canonical decomposition". But this is
> inadequate because
> >there are many cases of Latin/Greek/Cyrillic characters (and
> most cases of
> >Hebrew ones) where an accent should be removed even though
> there is no
> >precomposed form encoded and so canonical decomposition
>
> Correct. Whatever the set of combining marks is, we then need
> to define a
> set of base characters. We could simply use sc=Latin + sc=Greek +
> sc=Cyrillic as a starting set, to treat all accented
> character equally.
>
> What about other scripts:
>
> If you feel that Hebrew folding to unpointed is something that should
> happen everytime other accents are folded, we can add Hebrew
> (or we can
> make a separate fodling, HebrewMarksFolding,
> that people can invoke optionally) I tend to prefer the
> latter. Since for
> Hebrew (the languages), a folding to unpointed might be one
> of the foldings
> that someone might want to apply permanently, it should be
> separtely named
> and defined, on the principle that the foldings should be
> building blocks.
>
> A./
>
>
>
>
This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 21:57:45 CDT