re: Using Combining Double Breve and expressing characters perhaps as if struck out.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jul 24 2010 - 03:07:59 CDT

  • Next message: verdy_p: "re: Using Combining Double Breve and expressing characters perhaps as if struck out."

    > Message du 24/07/10 09:02
    > De : "William_J_G Overington" <wjgo_10009@btinternet.com>
    > A : unicode@unicode.org
    > Copie à : wjgo_10009@btinternet.com
    > Objet : Using Combining Double Breve and expressing characters perhaps as if struck out.
    >
    >
    > I have been looking at the following thread, which is entitled "Making Fonts with Diacritical Marks for Phonetics".
    >
    > http://forum.high-logic.com/viewtopic.php?f=3&t=3169
    >
    > I am writing here to ask two questions please in relation to the Unicode aspects of the problem.
    >
    > I have looked at http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf in section 2.11 Combining Characters (page 36 of the pdf) and at http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf in section 3.6 Combination (page 24 of the pdf).
    >
    > In http://www.unicode.org/charts/PDF/U0300.pdf there is U+035D COMBINING DOUBLE BREVE and there is U+035E COMBINING DOUBLE MACRON.
    >
    > In http://www.unicode.org/charts/PDF/U0000.pdf there is U+006F LATIN SMALL LETTER O.
    >
    > How does one express two letters LATIN SMALL LETTER O with a combining double breve in a Unicode plain text document please?

    First encode each base (unjoined) extended grapheme clusters
    separately (possibly with their own diacritics or extenders or
    prependers, including ZWJ and ZWNJ, according to their definition in
    the UAX defining text segmentations).

    Then encode the double diacritic between them.

    So for your examples you get <006F, 035D, 006F> (double breve) or
    <006F, 035D, 006F> (double macron).

    Double diacritics have a combining property equal to zero, so they
    block the reordering for canonical equivalences and the relative order
    and independance for the encoding of base grapheme clusters will be
    preserved during normalizations.

    As a consequence, if there's another diacritic added on top of the
    double diacritic, it can only be added at end of this sequence, but
    the bad thing is that it will appear just after the encoding of the
    second base grapheme cluster, and so it is subject to reordering, as
    it will be interpreted as being part itself of the second grapheme
    clusters.

    Currently you cannot add another diacritic on top of a double
    diacritic, we lack something for blocking such interpretation in the
    second cluster.

    To do that, we would need another base character with combining
    property 0 (blocking canonical reorderings), and that would have the
    same grouping semantic as other double diacritics. This character
    would be abstract (and invisible by itself) and could be something
    like:

      U+xyzt DOUBLE DIACRITIC HOLDER

    For example to add an acute accent above the double breve joining the
    two letters 'o', we would encode:

      <006F, 035D, 006F, xyzt, 0301>

    instead of just <006F, 035D, 006F, 0301> which is canonically
    equivalent to <006F, 035D, 00F3> and which encodes the letter 'o' and
    the letter 'o' with an acute accent (centered on this second o) joined
    with the double breve *above* the acute accent of the second 'o'.

    My opinion is that such double diacritic holder exists: it's ZWJ,
    which could be safely used as the needed invisible base for additional
    diacritics occuring on top (and centered) of a double diacritic. But
    others may have other preferences about the choice of this character.

    I don't know if ZWJ has been specified so that it could occur only
    before a "defective" combining sequence containing only combining
    diacritics. for this case, this would mean that the semantic of the
    combining diacritics encoded after it must apply to the full part of
    the extended grapheme cluster encoded before it.

    This use of ZWJ effectively allows the interpretation of the encoded
    sequence as if it was in TeX syntax:

      \acute{ \breve{oo} }

    Without the ZWJ, it would be interpreted as:

      \breve{ o\acute{o} }

    The double diacritics or just intended to be used between each base
    grapheme clusters to join. And it could possibly be used to groop more
    than 2 base grapheme, for example with 3 'o' as:

      <006F, 035D, 006F, 035D, 006F>

    interpreted in TeX syntax as: \breve{ooo}

    But even with this case, you wont be able to encode with the ZWJ trick
    in plain text, such groupings that are expressed this way in TeX:

      \breve{ \breve{oo} x \breve{ o\acute{o} } }

    Because double diacritics encoded in Unicode can't be safely stacked
    together (for such application you'll need a rich-text layer on top of
    Unicode, such as TeX here).

    Philippe.



    This archive was generated by hypermail 2.1.5 : Sat Jul 24 2010 - 03:10:49 CDT