Re: Folding algorithm and canonical equivalence

From: Peter Kirk (peterkirk@qaya.org)
Date: Sun Jul 18 2004 - 17:44:43 CDT

Next message: Doug Ewell: "Re: Writing Tatar using the Latin script; new characters to encode?"

Previous message: Peter Kirk: "Re: Writing Tatar using the Latin script; new characters to encode?"
In reply to: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"
Next in thread: John Cowan: "Re: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 18/07/2004 22:15, Asmus Freytag wrote:

> At 05:25 AM 7/18/2004, Peter Kirk wrote:
>
>> I accept that there might be some script-specific cases in which
>> particular accents should not be removed. The breve in Cyrillic i
>> kratkoe might be an example; but then this might be rather too
>> language-specific as well. But these should be clearly defined and
>> justified exceptions, rather than their possible existence being a
>> reason to restrict the general applicability of accent and diacritic
>> folding.
>
>
> I was thinking rather more of Khmer, where a some characters that are
> considered letters are given gc=Mn. In that case, folding would be
> very inappropriate.
>
> So the answer has to be to limit the removal of diacritical marks in
> AccentFolding, to those that are truly *accents*. That's a subset of
> gc=Mn. There are two options for a starting set:
> select all 'accents' (note, not baseforms) that occur in some
> precomposed character. And then add additional ones on a case by case
> basis (e.g. stroke overlay).
>
> Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter
> will be part of 4.1), and make some principled additions / deletions.

This sounds good to me. Among the additions should be all Hebrew
combining marks unless this is done separately.

>
> All script-specific non-spacing marks for Indic scripts etc; should
> not be part of 'AccentFolding', in my opinion.
>
>> .. when I look more closely at AccentFolding as defined I see a
>> problem with it. It is specified as affecting only
>> "Latin/Greek/Cyrillic characters with canonical decomposition". But
>> this is inadequate because there are many cases of
>> Latin/Greek/Cyrillic characters (and most cases of Hebrew ones) where
>> an accent should be removed even though there is no precomposed form
>> encoded and so canonical decomposition
>
>
> Correct. Whatever the set of combining marks is, we then need to
> define a set of base characters. We could simply use sc=Latin +
> sc=Greek + sc=Cyrillic as a starting set, to treat all accented
> character equally.
>
> What about other scripts:
>
> If you feel that Hebrew folding to unpointed is something that should
> happen everytime other accents are folded, we can add Hebrew (or we
> can make a separate fodling, HebrewMarksFolding,
> that people can invoke optionally) I tend to prefer the latter. Since
> for Hebrew (the languages), a folding to unpointed might be one of the
> foldings that someone might want to apply permanently, it should be
> separtely named and defined, on the principle that the foldings should
> be building blocks.

Agreed that it should be separate, but I would also see it as included
as a subset within the regular accent folding.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Doug Ewell: "Re: Writing Tatar using the Latin script; new characters to encode?"
Previous message: Peter Kirk: "Re: Writing Tatar using the Latin script; new characters to encode?"
In reply to: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"
Next in thread: John Cowan: "Re: Folding algorithm and canonical equivalence"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 17:45:24 CDT