Re: How to make "oo" with combining breve/macron over pair?

From: Mark Davis (mark@macchiato.com)
Date: Tue Mar 05 2002 - 22:36:15 EST


> a. Modify the grapheme cluster boundary rules to account for
> X CGJ NSM as a grapheme cluster.
>
> b. Change CGJ from Mn to Me.

It doesn't even need (a) to make this work. Because the committee
changed NSMs to be ignorable, X NSM* CGJ NSM* Y doesn't break. So only
(b) would need to be changed.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <david.hopwood@zetnet.co.uk>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Tuesday, March 05, 2002 18:00
Subject: Re: How to make "oo" with combining breve/macron over pair?

> David Hopwood said:
>
> > Kenneth Whistler wrote:
> > > Kent Karlsson's suggestion:
> > >
> > > > I vaguely suggested adding
> > > > an enclosing (in some sense) invisible combining character to
> > > > solve this: <o, CGJ, o, invisible-enclosing, combining breve>.
> > > > No character has been designated for such use, though. And I
> > > > haven't made a formal proposal yet.
> > >
> > > (i.e. create a generic way to make a non-enclosing combining
mark
> > > apply to a grapheme cluster, by encoding an invisible enclosing
> > > combining mark)
> >
> > For this approach to work, <invisible-enclosing> must have
combining
> > class 0, and be in Grapheme_Extend and general category Mn.
>
> Actually, it must be general category Me, since that is what
indicates
> a combining *enclosing* mark.
>
> > Because it
> > involves a new character, it can't be included in the standard
until
> > Unicode 3.3,
> ^^^^^^^^^^^
>
> Aghhh! Don't even introduce that nasty concept. The UTC and the
> editorial committee are already working on the Unicode 4.0 book
> draft, and many people would be sorely tempted to quit in disgust
> if we had to produce yet another UAX for Unicode 3.3 before
> 4.0 was finished!
>
> >
> > An alternative is to use CGJ itself for <invisible-enclosing>,
i.e.
> > <o, CGJ, o, CGJ, combining breve>. This works because:
> >
> > - CGJ has combining class 0, so it prevents the breve from
composing
> > with the second o.
> > - CGJ has general category Mn and is invisible, as required.
>
> It currently has general category Mn, but would have to be changed
> to Me to make this work.
>
> > - it is straightforward to modify the grapheme breaking rules to
> > treat this as a single cluster, by adding the rule "Link
Extend".
> > (This assumes the corrections to the other rules that I
described
> > in my comments.)
>
> Actually, I am finding myself attracted to the parsimony of this
> approach. In answer to Rick's suggestion to just encode the two we
> know about and be done with it, and his concern that we are headed
here
> for terminal Markupville, note the following:
>
> 1. Rendering applications already have to deal with combining
> enclosing marks (well, at least if they choose to support them).
> That means identifying what they enclose, and then adjusting any
> following combining mark to apply to the enclosure. (cf. TUS 3.0,
> p. 50). If the CGJ is just an invisible combining enclosing mark,
> then effectively it encloses the (invisible) bounding box of
> the preceding characters in its scope, and any following
> combining marks are adjusted to apply to that bounding box, which
> is the enclosure. A simple generalization without any new
architectural
> implications.
>
> 2. Applications concerned with grapheme cluster boundaries already
> (as of Unicode 3.2, at least) have to deal with the function
> of CGJ in creating grapheme clusters. That is, they will have
> to cope with the modified rules in Unicode 3.2 for grapheme
> cluster boundaries, and the new Grapheme_XXX properties that
> take the CGJ into account.
>
> So no new characters and no new architectural implications. Simply
> two minor tweaks:
>
> a. Modify the grapheme cluster boundary rules to account for
> X CGJ NSM as a grapheme cluster.
>
> b. Change CGJ from Mn to Me.
>
> That appears to be it, and in principle it should solve the
> missing double (or treble) diacritic representation problem
permanently.
> On the downside, it might be awhile before rendering engines
> and font definitions really catch up to it. That is, the whole
> notion of "adjusting" a diacritic to apply to an enclosure is
> fairly sophisticated, since it may involve context-dependent
> rules and arbitrary shape modifications -- not merely moving
> a glyph origin point based on a preceding glyph's metrics.
>
> On the other hand, hacked up fonts for limited dictionary
> usage could be rather quick and easy. For the old Webster's
> pronunciation guides, the entities are really the oomacr
> and oobreve shown in the examples that started this thread.
> Simply preform those entities as glyphs in a font, and map them
> to <o, CGJ, o, CGJ, combining_macron> and
> to <o, CGJ, o, CGJ, combining_breve> respectively. Presto,
> you have a Unicode representation for the text, and a
> reliable font rendering for them, without any fancy-dancing
> about dynamic positional adjustments. The fallback rendering,
> in applications and fonts not wise to the CGJ rules would
> be {o o-macron} and {o o-breve}, which while not exact,
> is at least comprehensible and close enough for gummint work.
>
> I think this might be the way to go, but it is too late to
> sneak into Unicode 3.2, as any such changes clearly would
> require UTC debate and agreement. But it is simple enough that
> it might be accomplished fairly quickly after Unicode 3.2.
>
> --Ken
>
>



This archive was generated by hypermail 2.1.2 : Tue Mar 05 2002 - 22:45:19 EST