RE: Decomposed vs Composed accented characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Mon Apr 10 2006 - 13:33:48 CST

  • Next message: Kenneth Whistler: "Re: Decomposed vs Composed accented characters"

    Otto Stolz wrote:
    > Among other codes, I had mentioned ISO 6937:
    ...
    > More specifically, this was from section 4.4 "Coded representations",
    > subsection a "Accented letters and umlauts".
    >
    > Now, Kent Karlsson has written:
    > > That text is at best misleading; I'd say it's completely wrong.
    > > In actual fact, ISO/IEC 6937 does not encode any combining
    > > characters, absolutely NONE whatsoever. Nor does it rely at all
    > > on any kind of composition.
    >
    > I have quoted from the 1983 version of that standard. I have no
    > easy access to its 1994, and 2001, versions. So, the parts that
    > I have quoted may, or may not, have been superseeded. If Kent
    > Karlson can quote the essential clauses from the current (2001)
    > version that invalidate my old version, I will be glad to learn
    > that the gist of that standard has completely been changed within
    > two revisions.

    No, no change. That misleading explanation of the design approach
    is still there. See http://std.dkuug.dk/jtc1/sc2/open/02n3239.pdf
    for a 1998 'Committee Draft' text.

    > Definition from ISO 6937/1-1983:
    > > 3.19 composite graphic symbol: A graphic symbol consisting of a
    > > combination of two or more other graphic symbols in a single
    > > character position, such as a diacritical mark an a basic letter,
    > > for example .
    >
    > So, that version clearly conveys the notion of combining diacritic
    > marks and base characters. This is exactly what William Tay had asked
    > about; so I think it was important to mention that standard. Kent,
    > thank you for reminding us to ISO 646, as well, which I had forgotten
    > to mention.

    Still misleading. The actual technical construction is that of lead
    bytes (in the range C0-CF) that *indicates* the accents in the
    *precomposed* characters encoded in 6937.

    Look at the table of encoded characters in table 4. There is not a
    single COMBINING character encoded, whether to be before or after
    a base character. This multibyte encoding is constructed to look
    like there is an "accent + base", but in actual fact that is not the
    case.

    So, table 4 is the key here. Not the somewhat clumsy explanation of
    the overall design (sometimes sidestepped) of the multibyte encoding.

    > Kent Karlsson also has written:
    > > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
    > > character.
    >
    > I cannot understand the distinction Kent draws between a "non-spacing
    > diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
    > character". It is just a technical detail, whether the base character
    > is encoded first (as in Unicode), or last (as in ISO 6937).

    Look at table 4.

    > > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
    > > 8th bit set) sort of indicate the accent of the character (but that
    > > does not always hold true) and the trail byte (if a
    > double-byte code)
    > > indicates the base character (except when the trail byte is the one
    > > for space).
    >
    > The essential difference between ISO 6937 and Unicode is that
    > ISO 6937 defines a closed inventory of combined characters, while

    There are no "combined" characters in 6937. There are quite a number
    of what Unicode calls *precomposed* characters, except that there is no
    composition in 6937.

    > Unicode allows arbitrary combinations. (This reflects the display
    > technology available at the respective times of origin.)
    >
    > Now it just so happens that all compositions in ISO 6937/2 comprise
    > only one diacritic (plus one base character, of course), which lets
    > ISO 6937/2 appear similar to a multibyte coded character set; however,
    > the intent apparently was a composition of one, or several, diacritics
    > with a base character (cf. definition 3.19, quoted supra) -- only
    > the original plans to encode characters for more languages (that may
    > carry more than one diacritical mark) never have been realized.

    6937 *is* a multibyte coded character encoding. But if you don't look
    closely enough, it appears similar to an encoding with combining
    characters (given before the base); but that it is definitely not. Look
    at table 4 again.

                    /kent k



    This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:32:25 CST