RE: Decomposed vs Composed accented characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Fri Apr 07 2006 - 14:07:35 CST

  • Next message: Richard Wordingham: "Re: Indus signs encoding"

    Otto Stolz wrote:
    > The title of the ISO 88591 series contains the term "single-byte coded
    > graphic character sets". The use of control functions for the coded
    > representation of composite characters is prohibited by ISO 8859,

    This is mentioned because in ISO/IEC 646, the use of *control* codes,
    more specifically backspace, was THE way of composing accented
    characters. E.g. <A, backspace, (7-bit) circumflex> was supposed
    to encode A WITH CIRCUMFLEX. The specification was incomplete,
    and that aspect of ISO 646 never caught on, instead the national
    variants did. Using backspace to encode composites is not allowed
    in ISO/IEC 8859-x, it is likewise not allowed in ISO/IEC 10646.

    > and there are no combining, or non-spacing (cf. infra), characters
    > defined.

    The Latin/Arabic (as you note) and Latin/Thai parts contain combining
    characters. There is no prohibition against combining characters.

    > normally are composing (such as Fatha, Damma, Kasra). However,

    They are just the "ordinary" combining characters used for Arabic.

    > ISO 6937 has been an approach to large character sets by heavy
    > use of composition. Quote from ISO 6937/2-1983:
    > > Each accented letter or umlaut is represented by a sequence
    > > of bit combinations consisting of the coded representation
    > > of the relevant non-spacong diacritical mark [...], followed
    > > by the coded representation of the relevant basic Latin letter
    > > [...]

    That text is at best misleading; I'd say it's completely wrong.

    In actual fact, ISO/IEC 6937 does not encode any combining
    characters, absolutely NONE whatsoever. Nor does it rely at all
    on any kind of composition.

    HOWEVER, it is a multibyte encoding, where lead bytes (with the
    8th bit set) sort of indicate the accent of the character (but that
    does not always hold true) and the trail byte (if a double-byte code)
    indicates the base character (except when the trail byte is the one
    for space).

    But the lead byte NEVER encodes any combining character.
    None, zilch, nada, niente, no way mate.

                    /kent k



    This archive was generated by hypermail 2.1.5 : Fri Apr 07 2006 - 14:44:51 CST