Re: Decomposed vs Composed accented characters

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Apr 10 2006 - 11:22:59 CST

  • Next message: Mike Ayers: "Re: How do I type unicode characters?"

    Hello,

    Tay, William had asked:
    > Can accented characters be decomposed in other encodings, e.g. ISO
    > 8859-1, as well?

    Among other codes, I had mentioned ISO 6937:
    > ISO 6937 has been an approach to large character sets by heavy
    > use of composition. Quote from ISO 6937/2-1983:
    > > Each accented letter or umlaut is represented by a sequence
    > > of bit combinations consisting of the coded representation
    > > of the relevant non-spacing diacritical mark [...], followed
    > > by the coded representation of the relevant basic Latin letter
    > > [...]
    More specifically, this was from section 4.4 "Coded representations",
    subsection a "Accented letters and umlauts".

    Now, Kent Karlsson has written:
    > That text is at best misleading; I'd say it's completely wrong.
    > In actual fact, ISO/IEC 6937 does not encode any combining
    > characters, absolutely NONE whatsoever. Nor does it rely at all
    > on any kind of composition.

    I have quoted from the 1983 version of that standard. I have no
    easy access to its 1994, and 2001, versions. So, the parts that
    I have quoted may, or may not, have been superseeded. If Kent
    Karlson can quote the essential clauses from the current (2001)
    version that invalidate my old version, I will be glad to learn
    that the gist of that standard has completely been changed within
    two revisions.

    Definition from ISO 6937/1-1983:
    > 3.19 composite graphic symbol: A graphic symbol consisting of a
    > combination of two or more other graphic symbols in a single
    > character position, such as a diacritical mark an a basic letter,
    > for example ä.

    So, that version clearly conveys the notion of combining diacritic
    marks and base characters. This is exactly what William Tay had asked
    about; so I think it was important to mention that standard. Kent,
    thank you for reminding us to ISO 646, as well, which I had forgotten
    to mention.

    Kent Karlsson also has written:
    > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
    > character.

    I cannot understand the distinction Kent draws between a "non-spacing
    diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
    character". It is just a technical detail, whether the base character
    is encoded first (as in Unicode), or last (as in ISO 6937).

    > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
    > 8th bit set) sort of indicate the accent of the character (but that
    > does not always hold true) and the trail byte (if a double-byte code)
    > indicates the base character (except when the trail byte is the one
    > for space).

    The essential difference between ISO 6937 and Unicode is that
    ISO 6937 defines a closed inventory of combined characters, while
    Unicode allows arbitrary combinations. (This reflects the display
    technology available at the respective times of origin.)

    Now it just so happens that all compositions in ISO 6937/2 comprise
    only one diacritic (plus one base character, of course), which lets
    ISO 6937/2 appear similar to a multibyte coded character set; however,
    the intent apparently was a composition of one, or several, diacritics
    with a base character (cf. definition 3.19, quoted supra) -- only
    the original plans to encode characters for more languages (that may
    carry more than one diacritical mark) never have been realized.

    Best wishes,
        Otto Stolz



    This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 16:00:21 CST