Re: Character Foldings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 26 2004 - 02:31:35 CDT

Next message: James Kass: "Re: Response to Everson Phoenician and why June 7?"

Previous message: Philippe Verdy: "Re: Proposal to encode dominoes and other game symbols"
In reply to: Mark Davis: "Re: New Public Review Issue posted"
Next in thread: John Cowan: "Re: Character Foldings"
Reply: John Cowan: "Re: Character Foldings"
Reply: Mark Davis: "Re: Character Foldings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 05:10 PM 5/25/2004, Mark Davis wrote:
>I don't think the "fold to base" is as useful as some other information. For
>those characters with a canonical decomposition, the decomposition carries
>more
>more information, since you can combine it with a "remove combining marks"
>folding to get the folding to base.

I think this would have to be 'remove combining *accents*'. You wouldn't
want to remove
Indic combining marks by force, if what you are interested in is L/G/C
style diacritic
removal.

>For my part, what would be more interesting would be a "full" decomposition of
>the characters that don't have a canonical decomposition, e.g.
>
>LATIN CAPITAL LETTER O WITH STROKE => O + /

I believe that when we first discussed this for TR30 it was mentioned that
there are characters with diacritic like features for which there aren't
combining accents because we deemed them not productive enough and
intractable enough for rendering purposes.

For those characters you wouldn't be able to make a true decomposition, but
the base character may still be well-defined.

I don't see where the decomposition would provide 'more more' information -
nobody suggests getting rid of it. The problem is, as I mentioned on the
Unicore list, how to combine flexibility for technically savvy implementers
with specifications of foldings that are based on the (linguistic) facets
that define the equivalence class.

This is in fact a good example: if I want to fold characters to their base
form, so that I can type a search term either from a keyboard that doesn't
have accents or by a user that doesn't know which one is correct, I can
proceed in two ways: I can create a one-stop-shopping folding that takes
any Unicode data stream and produces the desired result. Or I can string
together a number of building blocks, e.g. first normalize NFD, then
'decompose' fully, then remove accents.

In the first approach, tables will contain duplicate entries. I've pushed
the problem how to factor this onto the implementer (but given that all the
information is there, implementers could use semi-automated tools to create
an ad-hoc factoring).

In the second approach, I'm pushing the problem on how to assemble the
desired effect from building blocks onto implementers or worse, the end
users. That process quickly becomes non-intuitive, as the building blocks
give no hint about how they must be assembled.

Kana and Width folding and their interaction (and interaction with NFx) are
another good set of examples where this problem shows up.

One problem with the 'building blcoks' approach when it comes to foldings
is that foldings effectively have a domain of operation (characters outside
the domain are unaffected). However, certain oft-used primitives (e.g.
decomposition) have a different domain of operation than common foldings
(kana folding or width folding). By insisting on a chain of atomic
operations, the domain of data that's affect increases (it becomes the
superset).

A./

Next message: James Kass: "Re: Response to Everson Phoenician and why June 7?"
Previous message: Philippe Verdy: "Re: Proposal to encode dominoes and other game symbols"
In reply to: Mark Davis: "Re: New Public Review Issue posted"
Next in thread: John Cowan: "Re: Character Foldings"
Reply: John Cowan: "Re: Character Foldings"
Reply: Mark Davis: "Re: Character Foldings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 26 2004 - 02:33:45 CDT