combining diacritics for multi-char graphemes (WAS: triple diacritic (sch with ligature tie in a German dialect writin

From: Andr Szabolcs Szelp (
Date: Wed Jun 14 2006 - 05:12:46 CDT

  • Next message: Cristian Secară: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"


    [This mail contains some UTF-8 encoded IPA characters]

    The discussion has sparked about a "combining triple (width) breve below" for German 'sch' in a very particular context (of phonetic transcription).

    The question has been raised, wheather this "combining triple breve below" is productive, or not, as to decide to either encode it as a combining diacritic or as a lower-title-upper-case special character.
    Already some other examples in French and Greek were found.

    Now, the question should be generalised, IMHO.

    There _exist_ writing systems, which require in general diacritical marks to be placed above graphemes, not merely single characters. One such example was the tie below the 'sch', noted above.

    A quite expanded system with regards to using combining diacritics on graphemes is the system of Hungarian dialectological transcription (magyar egyezményes átírás/hangjelölés).

    In Standard Hungarian we have a lot of single characters (including some vowels with diacritics), quite a lot of digraphs (in brackets IPA equivalents) (ny[ɲ],ty[c],gy[ɟ],sz[s],zs[ʒ],cs[ʧ],dz[ʣ]) and a trigraph (dzs[ʤ]) (all standing for consonants). However, as I have mentioned, diacritics are used on vowels only.

    In dialectologial use, a certain transcription system has been used for decades and has long tradition, it can be found in a multitude of publications (unfortunately, I have no scanner available). This system uses besides some extensions (as oe-ligature, small capital D,G,K,...) mainly diacritics indicating a different realisation from the standard.
    Diacritcs stacking is used excessively as well (due to the fact, that already "base" characters of the phonetic alphabet -- i.e. graphemes of standard hungarian -- already might contain (graphically) diacritics.

    Such, diacritics can be applied to consonants as well, including the multi-character graphemes.

    One example is the Acute Accent, which indicates palatalisation (similar in function to the palatalisation hook of IPA), or the Grave Accent which indicates half-length (similar in fuction to the IPA half-length mark (one raised little triangle). Bridge Below and Inverted Bridge Below are used as well.

    The Grave Accent for example can stand on any grapheme -- vowels and consonant --, including the di- and trigraphs. The Acute Accent often stands on 'sz', 'zs', 'cs', 'dz' and 'dzs' (and of course on 'z', 's', 'c', 'n', 't', 'd', but they are of no concern here, being one-character graphemes).

    These combining diacritics must not be placed on either part of the digraph, but right in the centre. (Naturally, in the very case of the only trigraph, _graphically_ this will coincide with the D, Z WITH ACUTE ACCENT, S sequence, but logically not!).

    The presentation is important and bears information: (sz) with acute would be IPA [s U+0321] while s+z with acute IPA [ʃ z U+0321] and s with acute + z would render the [ʃ U+321 z] phonetic sequence.

    In case of the Grave Accent, placing the Grave on the 'c' of 'cs' would mean [ʦˑʃ] instead of [ʧˑ] with Grave placed atop the whole grapheme.


    I might have written a too long introduction, but the point I am trying to show is, that there are more productive examples of combining diacritics in di- and trigraphs (grapheme clusters), not just the so far mentioned ones.

    I believe, that a mechanics to encode diacritics on grapheme clusters should be enabled in Unicode.

    This _could_ be (while it is not the inteded use of it by the current standard) realised by the U+034F COMBINING GRAPHEME JOINER, as in
    the sequence U+0073 U+007A U+0301 being rendered with the acute above the 'z' (graphically equivalent to U+0073 U+017A; normal behaviour) and U+0073 U+034F=CGJ U+007A U+0301 as the acute placed on top in the middle of the (sz) as a grapheme.
    Similarly the German 'sch'-example could be encoded U+0073 U+034F U+0063 U+034F U+0068 U+032E [sic!]. (The use of the simple 'combining breve below' could be used in this scenario, discouraging the use double breve, which itself was always admitted not to fit into the logic of Unicode combining marks).

    Now, before I get reproaches, I do realize, that the very use of CGJ might theoretically have some undesired effects with existing data (while I do believe, that it's use, especially in the example cited next will be highly restricted, if exant et all). I understand, that if the croat grapheme U+01C6 is represented as U+0064 U+007A U+030C and a collation-sensitive encoding includes U+034F between the 'd' and the 'z' this would have undesirable effects. -- but shouldn't it then include a CGJ between the 'z' and the Combining Caron as well (see example on term 'grapheme cluster' at:, making the encoding disambigous (and making the current proposal unproblematic)?

    Now, actually, I'm not proposing a concrete solution of the problem by extending the definition of the field of use of CGJ, it just seemed to be a character apt for this (it could be a newly introduced format character as well); rather I'm proposing thinking of a way of encoding combining diacritics on di- tri- and poly-graphs. There seems to be demand, as previous and current examples have shown, they are productive.

    I am, of course, grateful, if you propose to discuss other viable encoding solutions for the above problem.

    All the best,
        Szabolcs Szelp

    "Feel free" – 10 GB Mailbox, 100 FreeSMS/Monat ...
    Jetzt GMX TopMail testen:

    This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 05:29:30 CDT