RE: Decomposed vs Composed accented characters

From: Keutgen, Walter (walter.keutgen@be.unisys.com)
Date: Wed Apr 12 2006 - 07:17:21 CST

  • Next message: Antoine Leca: "Re: "markers" codepoints for some combining letter sets in Dravidian scripts"

    Kent,

    reading the *draft* standard of which you kindly provided the link, I can only conclude that Otto's reading is correct. See the following quote (copied and pasted):

    "8.3 Coded representations of the graphic characters of the repertoire
    "
    "The coded representations of the graphic characters of the repertoire defined in this International Standard are
    "specified in table 4. The formats of the coded representations are as follows:
    "
    "a) Accented letters
    "Each accented letter is represented by a sequence of bit combinations consisting of the coded
    "representation of the relevant non-spacing diacritical mark (an element of the supplementary set),
    "followed by the coded representation of the relevant basic Latin letter (an element of the primary
    "set).
    "
    "b) Diacritical marks as separate graphic characters
    "The diacritical marks that are elements of the primary set (GRAVE ACCENT, CIRCUMFLEX ACCENT and
    "TILDE) are represented as separate graphic characters by the corresponding single bit combination in the
    "range 02/01 to 07/14.
    "The other ten of the diacritical marks of column 12 are represented as separate graphic characters by a
    "sequence of bit combinations consisting of the coded representation of the relevant non-spacing diacritical
    "mark (an element of the supplementary set), followed by the coded representation of the character SPACE,
    "i.e. the bit combination 02/00.
    "
    "c) All other graphic characters of the repertoire
    "Any graphic character of the repertoire, other than an accented letter or a diacritical mark as a
    "separate graphic character that is not an element of the primary set, is an element of either the
    "primary set or the supplementary set and is represented by the corresponding single bit
    "combination in the range 02/01 to 07/14 or 10/00 to 15/15.
    "Depending of the code extension techniques used, a bit combination, representing an element of either the primary
    "or the supplementary set may have to be preceded by a code extension function invoking the character set
    "concerned.

    The standard distinguishes 2 encoded character SETS, the PRIMARY one (tables 1 and 2) and the SUPPLEMENTARY one
    (tables 1 and 3), the latter including the 13 non-spacing diacritical MARKS, which are 'no characters' and have
    an encoded representation that may never stand alone, but must be followed by a base letter or the space, as
    restricted in the 'repertoire'.

    Table 4 defines the character REPERTOIRE i.e. the valid combinations.

    But there are contradictions, at least from the usability point of view:

    In Annex D:

    "NOTE 19
    "For spelling the Welsh language correctly, some more letters are in fact required. They are not
    "included in the repertoire, but are only identified here:
    "LATIN CAPITAL LETTER W WITH ACUTE
    "LATIN SMALL LETTER W WITH ACUTE
    "LATIN CAPITAL LETTER W WITH GRAVE
    "LATIN SMALL LETTER W WITH GRAVE
    "LATIN CAPITAL LETTER W WITH DIAERESIS
    "LATIN SMALL LETTER W WITH DIAERESIS
    "LATIN CAPITAL LETTER Y WITH GRAVE
    "LATIN SMALL LETTER Y WITH GRAVE

    No Welsh representative in the committee or a fee not paid by Wales? :-)

    In 7 bit encoding, escape sequences are necessary, which will separate the 'lead byte' from the 'base letter'.
    In my opinion this is a strange property for a precomposed encoding.

    The letter sequence 'lead', as in 'lead byte', does not appear in the text. The sequence 'compo' does, but
    composition is used at a higher level i.e. a repertoire composed of characters.

    Searching for 'combin' yields a lot. Most 'bit combination' of which:

    "4.15 repertoire: A specified set of characters that are represented by one or more bit combinations of a coded
    "character set.

    Why 'or more bit combinations'?

    The standards begins with a clear, not clumsy, combining mechanism and ends in allowing only some combinations,
    admitting that at least one language, Welsh, has been omitted. There is in my opinion the clumsiness.

    The only explanation is in the sub-repertoires. The possibility of defining sub-repertoires seems to be for a
    sub-application othat supports even less, and the main application is to adapt e.g. by choosing another
    sub-application. Anyway the standard seems however not to be released.

    'Annex C' is rather your opinion, but is marked 'informative'.

    ---------------------------------------------------------------------------------
    Interesting in this standard is annex D.

    It could be used for the exemplar character sets in the CLDR project.

    Best regards
              
    Walter Keutgen
    International Engineering Centre
    Unisys Belgium nv-sa

    THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Kent Karlsson
    Sent: Montag, den 10. April 2006 21:34
    To: 'Otto Stolz'; unicode@unicode.org
    Cc: 'Tay, William'
    Subject: RE: Decomposed vs Composed accented characters

    Otto Stolz wrote:
    > Among other codes, I had mentioned ISO 6937:
    ...
    > More specifically, this was from section 4.4 "Coded representations",
    > subsection a "Accented letters and umlauts".
    >
    > Now, Kent Karlsson has written:
    > > That text is at best misleading; I'd say it's completely wrong.
    > > In actual fact, ISO/IEC 6937 does not encode any combining
    > > characters, absolutely NONE whatsoever. Nor does it rely at all
    > > on any kind of composition.
    >
    > I have quoted from the 1983 version of that standard. I have no
    > easy access to its 1994, and 2001, versions. So, the parts that
    > I have quoted may, or may not, have been superseeded. If Kent
    > Karlson can quote the essential clauses from the current (2001)
    > version that invalidate my old version, I will be glad to learn
    > that the gist of that standard has completely been changed within
    > two revisions.

    No, no change. That misleading explanation of the design approach
    is still there. See http://std.dkuug.dk/jtc1/sc2/open/02n3239.pdf
    for a 1998 'Committee Draft' text.

    > Definition from ISO 6937/1-1983:
    > > 3.19 composite graphic symbol: A graphic symbol consisting of a
    > > combination of two or more other graphic symbols in a single
    > > character position, such as a diacritical mark an a basic letter,
    > > for example .
    >
    > So, that version clearly conveys the notion of combining diacritic
    > marks and base characters. This is exactly what William Tay had asked
    > about; so I think it was important to mention that standard. Kent,
    > thank you for reminding us to ISO 646, as well, which I had forgotten
    > to mention.

    Still misleading. The actual technical construction is that of lead
    bytes (in the range C0-CF) that *indicates* the accents in the
    *precomposed* characters encoded in 6937.

    Look at the table of encoded characters in table 4. There is not a
    single COMBINING character encoded, whether to be before or after
    a base character. This multibyte encoding is constructed to look
    like there is an "accent + base", but in actual fact that is not the
    case.

    So, table 4 is the key here. Not the somewhat clumsy explanation of
    the overall design (sometimes sidestepped) of the multibyte encoding.

    > Kent Karlsson also has written:
    > > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
    > > character.
    >
    > I cannot understand the distinction Kent draws between a "non-spacing
    > diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
    > character". It is just a technical detail, whether the base character
    > is encoded first (as in Unicode), or last (as in ISO 6937).

    Look at table 4.

    > > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
    > > 8th bit set) sort of indicate the accent of the character (but that
    > > does not always hold true) and the trail byte (if a
    > double-byte code)
    > > indicates the base character (except when the trail byte is the one
    > > for space).
    >
    > The essential difference between ISO 6937 and Unicode is that
    > ISO 6937 defines a closed inventory of combined characters, while

    There are no "combined" characters in 6937. There are quite a number
    of what Unicode calls *precomposed* characters, except that there is no
    composition in 6937.

    > Unicode allows arbitrary combinations. (This reflects the display
    > technology available at the respective times of origin.)
    >
    > Now it just so happens that all compositions in ISO 6937/2 comprise
    > only one diacritic (plus one base character, of course), which lets
    > ISO 6937/2 appear similar to a multibyte coded character set; however,
    > the intent apparently was a composition of one, or several, diacritics
    > with a base character (cf. definition 3.19, quoted supra) -- only
    > the original plans to encode characters for more languages (that may
    > carry more than one diacritical mark) never have been realized.

    6937 *is* a multibyte coded character encoding. But if you don't look
    closely enough, it appears similar to an encoding with combining
    characters (given before the base); but that it is definitely not. Look
    at table 4 again.

                    /kent k



    This archive was generated by hypermail 2.1.5 : Wed Apr 12 2006 - 07:19:29 CST