Re: Decomposed vs Composed accented characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 11 2006 - 18:34:54 CST

  • Next message: Kenneth Whistler: "RE: Decomposed vs Composed accented characters"

    Otto Stolz responded to Kent:

    > Among other codes, I had mentioned ISO 6937:
    > > ISO 6937 has been an approach to large character sets by heavy
    > > use of composition. Quote from ISO 6937/2-1983:
    > > > Each accented letter or umlaut is represented by a sequence
    > > > of bit combinations consisting of the coded representation
    > > > of the relevant non-spacing diacritical mark [...], followed
    > > > by the coded representation of the relevant basic Latin letter
    > > > [...]
    > More specifically, this was from section 4.4 "Coded representations",
    > subsection a "Accented letters and umlauts".

    That is now in section 8.3 "Coded represenations of the graphic
    characters of the repertoire", subsection a) "Accented letters"
    in ISO 6937:2001, and reads essentially the same:

    "Each accented letter is represented by a sequence of bit
    combinations consisting of the coded representation of
    the relevant non-spacing diacritical mark (an element of
    the supplementary set), followed by the coded representation
    of the relevant basic Latin letter (an element of the primary
    set)."

    > Now, Kent Karlsson has written:
    > > That text is at best misleading; I'd say it's completely wrong.
    > > In actual fact, ISO/IEC 6937 does not encode any combining
    > > characters, absolutely NONE whatsoever. Nor does it rely at all
    > > on any kind of composition.
    >
    > I have quoted from the 1983 version of that standard. I have no
    > easy access to its 1994, and 2001, versions. So, the parts that
    > I have quoted may, or may not, have been superseeded.

    Slightly edited (removing "umlaut" as a special case), but
    otherwise unchanged.

    > If Kent
    > Karlson can quote the essential clauses from the current (2001)
    > version that invalidate my old version, I will be glad to learn
    > that the gist of that standard has completely been changed within
    > two revisions.

    It hasn't been, but... Kent is still correct.

    >
    > Definition from ISO 6937/1-1983:
    > > 3.19 composite graphic symbol: A graphic symbol consisting of a
    > > combination of two or more other graphic symbols in a single
    > > character position, such as a diacritical mark an a basic letter,
    > > for example ä.

    That definition has been removed, as the definitions of ISO 6937
    have been aligned with 10646 and 8859 as much as possible. Now you
    have (of relevance), just:

    4.13 graphic character: a character, other than a control function,
    that has a visual representation normally handwritten, printed
    or displayed, and that has a coded representation consisting of
    one or more bit combinations.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    and

    4.15 repertoire: a specified set of characters that are represented
    by one or more bit combinations of a coded character set
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       
    > So, that version clearly conveys the notion of combining diacritic
    > marks and base characters.

    It conveys this notion, but in the sense that Kent explained,
    rather than what you seem to be driving at here.

    > This is exactly what William Tay had asked
    > about; so I think it was important to mention that standard. Kent,
    > thank you for reminding us to ISO 646, as well, which I had forgotten
    > to mention.
    >
    > Kent Karlsson also has written:
    > > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
    > > character.
    >
    > I cannot understand the distinction Kent draws between a "non-spacing
    > diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
    > character". It is just a technical detail, whether the base character
    > is encoded first (as in Unicode), or last (as in ISO 6937).

    Actually, it is not "just a technical detail". It is built into
    the definition of ISO 6937 now. Clause 7:

    <quote>
    7 Composition of the character repertoire

    The repertoire of the graphic characters defined in this Internationa
    Standard consists of

    a) SPACE (SP)

    and of 332 characters as follows

    b) Latin alphabetic characters comprising

       1) the 52 capital and small letters of the basic Latin alphabet,
       
       2) accented letters, the graphic representations of which consist
          of combinations of basic Latin letters with diacritical marks,
          
       ...
       
    The repertoire, excluding SPACE, is specified in Table 4. In each
    table entry, the first column specifies the name of the character.
    The second column specifies its coded representation...
    </quote>

    > > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
    > > 8th bit set) sort of indicate the accent of the character (but that
    > > does not always hold true) and the trail byte (if a double-byte code)
    > > indicates the base character (except when the trail byte is the one
    > > for space).
    >
    > The essential difference between ISO 6937 and Unicode is that
    > ISO 6937 defines a closed inventory of combined characters,

    This is true... and is what is specfied as the "character repertoire"
    listed exhaustively in Table 4 of the standard. So, for example:

    Name Coded representation

    LATIN SMALL LETTER A WITH ACUTE 12/02 06/01

    (or translated to hex for modern users: <C2 61>)

    What you are missing here is that C2, while being the "bit
    combination" representing the "diacritical mark" "non-spacing
    acute accent", is *not* a member of the encoded character
    repertoire. <C2 61> *is*. <C2> is *not*.

    To drive home this point, the standard says:

    <quote>
    The names of the characters and non-spacing diacritical marks
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    are specified in Table 3. In order to stress that non-spacing
    diacritical marks are not characters, the names given to them
                      ^^^^^^^^^^^^^^^^^^
    are printed in lower case italics.
    </quote>

    [ emphasis added by me ]

    And, in Annex C:

    <quote>
      NOTE: The term "non-spacing diacritical mark" is used in
      this International STandard in a metaphorical sense only.
      The use of non-spacing diacritical marks is limited to
      combinations implied by the following table:
      
      Table C.1 - Combinations of diacritical marks and basic letters
    </quote>

    This kind of clarification, added in the wake of Unicode
    and 10646 with their willingness to encode non-spacing marks
    *as* productive characters, was insisted upon by the editor
    of ISO 6937 and by the principle users of ISO 6937 (including
    the then Netherlands NB representative), in part to make it
    crystal clear that ISO 6937 did *not* encode non-spacing
    combining mark *characters*, but only bit combinations
    representing diacritics that in combination with basic
    Latin letters created 2-octet sequences that encoded a
    single graphic character.

    O.k., get it?

    > while
    > Unicode allows arbitrary combinations. (This reflects the display
    > technology available at the respective times of origin.)
    >
    > Now it just so happens that all compositions in ISO 6937/2 comprise
    > only one diacritic (plus one base character, of course), which lets
    > ISO 6937/2 appear similar to a multibyte coded character set; however,
    > the intent apparently was a composition of one, or several, diacritics
    > with a base character (cf. definition 3.19, quoted supra) -- only
    > the original plans to encode characters for more languages (that may
    > carry more than one diacritical mark) never have been realized.

    The scope of ISO 6937 states, among other things, that it:

     b) specfies a repertoire of the Latin alphabetic and non-alphabetic
        characters for the communication of text in many European
        languages using the Latin script;
        
    I don't believe it ever was scoped to have universal intent for
    its coverage, nor for it to go beyond the list of enumerated
    accented characters now listed in Table C.1.

    Sure, if ISO 6937 had seen wider implementation and been more
    successful, at some point, making use of the limited set of
    bit combinations representing diacritics, it might eventually
    have been extended by adding a few combinations like g-hacek
    and k-hacek for Skolt Sami, o-ogonek for Sami, h-hacek for
    Finnish Romany, and so on. But I seriously doubt it would
    have have attempted to include a mechanism for two diacritics
    on a letter, nor have any kind of productivity. The *characters*
    were those listed in the repertoire, not the diacritics.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:36:37 CST