Re: Folding algorithm and canonical equivalence

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Jul 17 2004 - 18:59:05 CDT

  • Next message: E. Keown: "Request for 'Hebrew Extended' block in BMP"

    On 18/07/2004 00:46, Asmus Freytag wrote:

    > Thank you for reviewing this.
    >
    > DiacriticFolding (unlike AccentFolding) is selective about which
    > combining marks it removes for which base character. I wonder whether
    > that's truly intended, or whether it could be replaced by a
    > combination of
    >
    > AccentFolding
    > OtherDiacriticFolding
    >
    > where AccentFolding removes *all* nonspacing marks following Latin,
    > Greek or Cyrillic letters and we would remove from DiacriticFolding
    > all cases that are already handled by accent folding.
    >
    > That still doesn't take care of Hebrew, so we would need to decide how
    > to handle that. Perhaps you would like to put forth a proposal as to
    > what accents or diacritics should be folded for Hebrew, and in what
    > context. Is it just Dagesh?

    No, Dagesh is actually the *least* likely combining mark to be stripped
    as it is the most closely bound to the base character (and for this
    reason ended up in legacy precomposed characters and thence into the
    draft table). But I think the best thing to do is to drop *all* Hebrew
    combining marks; the result of this is valid unpointed Hebrew. This
    corresponds to the implicit folding already defined by SII and described
    in the quotation from SI 4281 in
    http://www.qsm.co.il/Hebrew/Responses%20to%20Several%20Hebrew%20Items.pdf
    = L2/04-213. But Jony Rosenne needs to provide input on this.

    >
    > The other alternative would be to limit the nonspacing marks to those
    > that actually occur with Latin / Greek / Cyrillic letters as ordinary
    > diacritics (i.e. all the diacritics that show up in
    > DiacriticFolding.txt), but then remove them if they follow *any* base
    > character from that set, not just in certain fixed combinations.

    Are there actually cases where these marks follow any other base
    characters and they should *not* be removed? That is what confuses me.
    It would be much simpler just to delete them independent of context.

    >
    > Rather than list the mappings in a file, we would simply list the
    > conditions, similar to AccendFolding (see
    > http://www.unicode.org/reports/tr30/Foldings.txt) and reduce the data
    > file to those cases where there are no mappings (o with stroke -> o,
    > combining stroke overlay, etc.).

    I think you mean
    http://www.unicode.org/reports/tr30/datafiles/Foldings.txt. This seems
    sensible to me.

    >
    > John, you proposed the initial set. Do you have any suggestion here?
    >
    > A./
    >
    >
    >
    >
    >

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Sat Jul 17 2004 - 19:00:51 CDT