Re: Character Foldings

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed May 26 2004 - 10:24:24 CDT

  • Next message: Dean Snyder: "RE: Why Fraktur is irrelevant (was RE: Fraktur Legibility (was Re: Response to Everson Phoenician)"

    The combinatorics get nasty if you want to have prepackaged foldings for
    everything. It is better to provide the building blocks, and *then* if there are
    a few derived, prepackaged folding that make sense, supply those as well.

    The reason for not doing a "full" decomposition in the canonical mappings go
    back a long way, and the decision was taken then not to decompose characters
    that have significant 'overlap' or change in shape. So we don't decompose U+0127
    LATIN SMALL LETTER H WITH STROKE to, well, h and stroke. Nor U+0199 LATIN SMALL
    LETTER K WITH HOOK, and so on.

    Whether or not that was a good decision, it is water under the bridge -- of
    course I am not suggesting any change to canonical mappings. However -- as you
    quickly discover when you do things like collation or matching -- for characters
    outside of a language's repertoire, people do think of these things as
    variations on the base letter. Having a more comprehensive mapping to know
    *which* variantion it is (with what other characters) is valuable, and provides
    more information than just the base letter.

    Once you have this mapping, it is trivial for anyone to use it to get a mapping
    that just goes to base letters. But if you only supply the base letter mapping,
    you don't have enough information to get the original back.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    To: "Mark Davis" <mark.davis@jtcsv.com>; <jcowan@reutershealth.com>;
    <unicode@unicode.org>
    Sent: Wed, 2004 May 26 00:31
    Subject: Re: Character Foldings

    > At 05:10 PM 5/25/2004, Mark Davis wrote:
    > >I don't think the "fold to base" is as useful as some other information. For
    > >those characters with a canonical decomposition, the decomposition carries
    > >more
    > >more information, since you can combine it with a "remove combining marks"
    > >folding to get the folding to base.
    >
    > I think this would have to be 'remove combining *accents*'. You wouldn't
    > want to remove
    > Indic combining marks by force, if what you are interested in is L/G/C
    > style diacritic
    > removal.
    >
    > >For my part, what would be more interesting would be a "full" decomposition
    of
    > >the characters that don't have a canonical decomposition, e.g.
    > >
    > >LATIN CAPITAL LETTER O WITH STROKE => O + /
    >
    > I believe that when we first discussed this for TR30 it was mentioned that
    > there are characters with diacritic like features for which there aren't
    > combining accents because we deemed them not productive enough and
    > intractable enough for rendering purposes.
    >
    > For those characters you wouldn't be able to make a true decomposition, but
    > the base character may still be well-defined.
    >
    >
    > I don't see where the decomposition would provide 'more more' information -
    > nobody suggests getting rid of it. The problem is, as I mentioned on the
    > Unicore list, how to combine flexibility for technically savvy implementers
    > with specifications of foldings that are based on the (linguistic) facets
    > that define the equivalence class.
    >
    > This is in fact a good example: if I want to fold characters to their base
    > form, so that I can type a search term either from a keyboard that doesn't
    > have accents or by a user that doesn't know which one is correct, I can
    > proceed in two ways: I can create a one-stop-shopping folding that takes
    > any Unicode data stream and produces the desired result. Or I can string
    > together a number of building blocks, e.g. first normalize NFD, then
    > 'decompose' fully, then remove accents.
    >
    > In the first approach, tables will contain duplicate entries. I've pushed
    > the problem how to factor this onto the implementer (but given that all the
    > information is there, implementers could use semi-automated tools to create
    > an ad-hoc factoring).
    >
    > In the second approach, I'm pushing the problem on how to assemble the
    > desired effect from building blocks onto implementers or worse, the end
    > users. That process quickly becomes non-intuitive, as the building blocks
    > give no hint about how they must be assembled.
    >
    > Kana and Width folding and their interaction (and interaction with NFx) are
    > another good set of examples where this problem shows up.
    >
    > One problem with the 'building blcoks' approach when it comes to foldings
    > is that foldings effectively have a domain of operation (characters outside
    > the domain are unaffected). However, certain oft-used primitives (e.g.
    > decomposition) have a different domain of operation than common foldings
    > (kana folding or width folding). By insisting on a chain of atomic
    > operations, the domain of data that's affect increases (it becomes the
    > superset).
    >
    > A./
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed May 26 2004 - 10:26:49 CDT