Re: Why people still want to encode precomposed letters

From: Karl Pentzlin (
Date: Sun Nov 23 2008 - 05:00:23 CST

  • Next message: philip chastney: "RE: Why people still want to encode precomposed letters"

    Am Sonntag, 23. November 2008 um 05:45 schrieb Doug Ewell:

    >>> (Karl Pentzlin):
    >>> Thus, sequences like U+04E9 U+0304 are NOT appropriate to fulfil the
    >>> user's needs, as long as leading operating systems behave like this
    >>> more than 10 years after Unicode has decided no longer to accept
    >>> precomposed characters.
    >>> Microsoft et al., PLEASE do your homework! Please do it RIGHT NOW!
    DE> I think Karl may have expected that fonts could be developed in such a
    DE> way that combining diacritical marks would be spaced properly above the
    DE> base character, ...

    That is exactly true, if "properly" simply means "in a way regarding the
    formal combining classes, providing a result which can be recognized by the

    DE> more or less by magic.

    Yes, if "magic" is colloquial for "done by a complex and well-designed
    algorithm which possibly is not obvious for everybody at first glance" -
    something which computer scientists (like me) do sometimes.

    DE> I used to think that would be
    DE> possible when I knew nothing about font design, ...

    Maybe, but for myself I claim to know at least some of the basics about
    font design. I appreciate it as a fine art where not everybody is gifted
    to create a Gentium or Andron, but the technical basics are comprehensible.

    DE> I still think it would be reasonable to expect combining marks like
    DE> macrons and circumflexes to be always centered over the base character,
    DE> not off to the right, even if the vertical spacing is wrong.

    At least, this. This can be accomplished by an algorithm; a very crude
    but working starting point is this: Enclose the base character's glyph by a
    rectangle. Determine the center (geometrically; possible refinement:
    barycentrally). Get the diacritic glyph from the font itself, of (if not
    applicable) from a system default font, and enclose it by a rectangle.
    Determine the center (geometrically). Translate the combining class of the
    diacritic into a pair of positioning angle and distance, using a fixed table
    made once. Place the diacritic rectangle outside of to the base
    character, regarding the positioning angle relative to the center points,
    and shift it inwards until the distance is accomplished. If another diacritic is to
    be added, enclose the combination generated until now by a rectangle
    retaining the center point of the original base character, call this the
    base character rectangle, and repeat. After finishing, take the final
    enclosing rectangle into consideration for line positioning.

    A "real working" algorithm like this may need some 100 pages to write down,
    but that is what the skilled developers at Microsoft et al. are paid for.

    Am Sonntag, 23. November 2008 um 04:29 schrieb Peter Constable:
    PC> How would you suggest anybody do the homework needed to discover
    PC> that arbitrary & not-well-documented language X uses combining
    PC> character sequence <Y, Z>?
    The latter is *explicitly* no precondition for your homework. Your task
    is: "For European Alphabetic Scripts, implement a solution for any
    combinations of base characters and combining characters, especially for
    arbitrary combinations which are *not* explicitly considered in the
    available rendering system".
    It shall be noted that, when it was decided in 1996 to encode
    precomposed characters of European Alphabetic Scripts no longer,
    this did not affect all diacritics.
    In fact, it has affected those diacritics which can successfully be
    handled by an algorithm as outlined above.
    For all diacritics which need special font-specific treatment,
    precomposed characters still are encoded after 1996, and have to be
    encoded if new ones are encountered.
    Such diacritics are e.g.:
    - slash overlays (horizontal and diagonal),
    - other overlays (e.g. middle tilde, double bar),
    - palatal hooks and retroflex hooks,
    - descenders.
    While there seems no official information being available, it seems to
    be that this decision was made with care, explicitly distinguishing
    diacritics which can be positioned automatically within reasonable
    constraints, and such which cannot.
    This seems to be an (implicit, as now) part of the encoding model for
    the European Alphabetic Scripts.
    (If this assumption is correct, I propose to state this explicitly
    in the next printed version of the Unicode Standard).
    It differs from the Arabic model (where characters which are considered
    as precomposed by some are encoded as single units), and it differs
    from models used for South Asian scripts (where combining marks are
    encoded separately even if they affect the shape of the base character's
    glyph considerably).
    PC> Usage of combining marks with Cyrillic is nowhere near as
    PC> widespread as it is with Latin. I think Vista does pretty well
    PC> supporting arbitrary combining sequences for Latin in several
    PC> fonts, as well as certain known-to-be-used sequences for Cyrillic.
    At least, there is a significant progress visible in Vista regarding
    Latin combinations. As doing this for Cyrillic also does not imply
    any real new mechanism, may I expect the same level of support for
    Cyrillic in the next SP for Windows Vista?
    - Karl Pentzlin

    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2008 - 05:03:41 CST