Complex Combining

From: Peter Kirk (
Date: Thu Nov 27 2003 - 11:11:55 EST

  • Next message: Philippe Verdy: "RE: Decimal digit property - What's it for?"

    On 27/11/2003 05:00, Philippe Verdy wrote:

    > ...
    > encircle(<DIGIT 9, DIGIT 2, DIGIT 3, DOT, DIGIT 0>)
    > DIGIT 0>
    >Here you don't have any ZWJ character, that's the double diacritic which
    >creates explicitly the ligature between the previous and next base
    >All these solutions are not specified in the standard. This is a pure
    >convention of use of Unicode, and until there's some enhancement published
    >in the Unicode character model, to clearly create ranges of characters on
    >which diacritics can be applied, without the too simple ZWJ control, this
    >interpretation of such encoded text will remain application-dependant.
    This is all rather interesting speculation. There are surely a lot of
    potential cases in scripts where some kind of combining mark can be
    considered as applying to a sequence of an arbitrary number of
    characters. For example:

    Enclosing circles, squares and ellipses.
    Continuous underlines and overlines.
    Continuous tildes, slurs, contour tone marks etc which may apply to
    several characters or whole words.
    The cartouche in Egyptian hieroglyphs, which surrounds a group of
    several characters.
    A number of mathematical functions e.g. fraction dividers, extensions to
    root signs.
    Combining marks which are supposed to be centred over or under two or
    more characters or even a whole word, like the Hebrew masora circle.

    Now I am sure it could be argued that some of these are not plain text
    and so should be dealt with by higher level markup. But maybe some of
    these need to be considered as part of plain text; for example, it is at
    least conceivable, and arguably true of the Egyptian cartouche, that
    these marks are required for proper understanding of the plain text,
    just as much so as regular letters and combining marks.

    So how should they be represented? Philippe's suggestion of <c1, mark,
    c2, mark, c3, mark... mark, cn> would seem to work, but could be very
    inefficient. Jill's alternative <bracket1, c1, c2, c3... cn, bracket2,
    mark> is more efficient for long sequences. But perhaps better would be
    to have paired opening and closing marks: <mark1, c1, c2, c3... cn,
    mark2> - although this requires a new pair of characters for each such case.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Thu Nov 27 2003 - 11:56:48 EST