Re: Character Foldings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 26 2004 - 02:31:35 CDT

  • Next message: James Kass: "Re: Response to Everson Phoenician and why June 7?"

    At 05:10 PM 5/25/2004, Mark Davis wrote:
    >I don't think the "fold to base" is as useful as some other information. For
    >those characters with a canonical decomposition, the decomposition carries
    >more
    >more information, since you can combine it with a "remove combining marks"
    >folding to get the folding to base.

    I think this would have to be 'remove combining *accents*'. You wouldn't
    want to remove
    Indic combining marks by force, if what you are interested in is L/G/C
    style diacritic
    removal.

    >For my part, what would be more interesting would be a "full" decomposition of
    >the characters that don't have a canonical decomposition, e.g.
    >
    >LATIN CAPITAL LETTER O WITH STROKE => O + /

    I believe that when we first discussed this for TR30 it was mentioned that
    there are characters with diacritic like features for which there aren't
    combining accents because we deemed them not productive enough and
    intractable enough for rendering purposes.

    For those characters you wouldn't be able to make a true decomposition, but
    the base character may still be well-defined.

    I don't see where the decomposition would provide 'more more' information -
    nobody suggests getting rid of it. The problem is, as I mentioned on the
    Unicore list, how to combine flexibility for technically savvy implementers
    with specifications of foldings that are based on the (linguistic) facets
    that define the equivalence class.

    This is in fact a good example: if I want to fold characters to their base
    form, so that I can type a search term either from a keyboard that doesn't
    have accents or by a user that doesn't know which one is correct, I can
    proceed in two ways: I can create a one-stop-shopping folding that takes
    any Unicode data stream and produces the desired result. Or I can string
    together a number of building blocks, e.g. first normalize NFD, then
    'decompose' fully, then remove accents.

    In the first approach, tables will contain duplicate entries. I've pushed
    the problem how to factor this onto the implementer (but given that all the
    information is there, implementers could use semi-automated tools to create
    an ad-hoc factoring).

    In the second approach, I'm pushing the problem on how to assemble the
    desired effect from building blocks onto implementers or worse, the end
    users. That process quickly becomes non-intuitive, as the building blocks
    give no hint about how they must be assembled.

    Kana and Width folding and their interaction (and interaction with NFx) are
    another good set of examples where this problem shows up.

    One problem with the 'building blcoks' approach when it comes to foldings
    is that foldings effectively have a domain of operation (characters outside
    the domain are unaffected). However, certain oft-used primitives (e.g.
    decomposition) have a different domain of operation than common foldings
    (kana folding or width folding). By insisting on a chain of atomic
    operations, the domain of data that's affect increases (it becomes the
    superset).

    A./



    This archive was generated by hypermail 2.1.5 : Wed May 26 2004 - 02:33:45 CDT