Re: Decomposed vs Composed accented characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 11 2006 - 18:34:54 CST

Next message: Kenneth Whistler: "RE: Decomposed vs Composed accented characters"

Previous message: Kent Karlsson: "RE: Decomposed vs Composed accented characters"
Maybe in reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Kenneth Whistler: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Otto Stolz responded to Kent:

> Among other codes, I had mentioned ISO 6937:
> > ISO 6937 has been an approach to large character sets by heavy
> > use of composition. Quote from ISO 6937/2-1983:
> > > Each accented letter or umlaut is represented by a sequence
> > > of bit combinations consisting of the coded representation
> > > of the relevant non-spacing diacritical mark [...], followed
> > > by the coded representation of the relevant basic Latin letter
> > > [...]
> More specifically, this was from section 4.4 "Coded representations",
> subsection a "Accented letters and umlauts".

That is now in section 8.3 "Coded represenations of the graphic
characters of the repertoire", subsection a) "Accented letters"
in ISO 6937:2001, and reads essentially the same:

"Each accented letter is represented by a sequence of bit
combinations consisting of the coded representation of
the relevant non-spacing diacritical mark (an element of
the supplementary set), followed by the coded representation
of the relevant basic Latin letter (an element of the primary
set)."

> Now, Kent Karlsson has written:
> > That text is at best misleading; I'd say it's completely wrong.
> > In actual fact, ISO/IEC 6937 does not encode any combining
> > characters, absolutely NONE whatsoever. Nor does it rely at all
> > on any kind of composition.
>
> I have quoted from the 1983 version of that standard. I have no
> easy access to its 1994, and 2001, versions. So, the parts that
> I have quoted may, or may not, have been superseeded.

Slightly edited (removing "umlaut" as a special case), but
otherwise unchanged.

> If Kent
> Karlson can quote the essential clauses from the current (2001)
> version that invalidate my old version, I will be glad to learn
> that the gist of that standard has completely been changed within
> two revisions.

It hasn't been, but... Kent is still correct.

>
> Definition from ISO 6937/1-1983:
> > 3.19 composite graphic symbol: A graphic symbol consisting of a
> > combination of two or more other graphic symbols in a single
> > character position, such as a diacritical mark an a basic letter,
> > for example ä.

That definition has been removed, as the definitions of ISO 6937
have been aligned with 10646 and 8859 as much as possible. Now you
have (of relevance), just:

4.13 graphic character: a character, other than a control function,
that has a visual representation normally handwritten, printed
or displayed, and that has a coded representation consisting of
one or more bit combinations.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and

4.15 repertoire: a specified set of characters that are represented
by one or more bit combinations of a coded character set
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> So, that version clearly conveys the notion of combining diacritic
> marks and base characters.

It conveys this notion, but in the sense that Kent explained,
rather than what you seem to be driving at here.

> This is exactly what William Tay had asked
> about; so I think it was important to mention that standard. Kent,
> thank you for reminding us to ISO 646, as well, which I had forgotten
> to mention.
>
> Kent Karlsson also has written:
> > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
> > character.
>
> I cannot understand the distinction Kent draws between a "non-spacing
> diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
> character". It is just a technical detail, whether the base character
> is encoded first (as in Unicode), or last (as in ISO 6937).

Actually, it is not "just a technical detail". It is built into
the definition of ISO 6937 now. Clause 7:

<quote>
7 Composition of the character repertoire

The repertoire of the graphic characters defined in this Internationa
Standard consists of

a) SPACE (SP)

and of 332 characters as follows

b) Latin alphabetic characters comprising

   1) the 52 capital and small letters of the basic Latin alphabet,

   2) accented letters, the graphic representations of which consist
      of combinations of basic Latin letters with diacritical marks,

   ...

The repertoire, excluding SPACE, is specified in Table 4. In each
table entry, the first column specifies the name of the character.
The second column specifies its coded representation...
</quote>

> > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
> > 8th bit set) sort of indicate the accent of the character (but that
> > does not always hold true) and the trail byte (if a double-byte code)
> > indicates the base character (except when the trail byte is the one
> > for space).
>
> The essential difference between ISO 6937 and Unicode is that
> ISO 6937 defines a closed inventory of combined characters,

This is true... and is what is specfied as the "character repertoire"
listed exhaustively in Table 4 of the standard. So, for example:

Name Coded representation

LATIN SMALL LETTER A WITH ACUTE 12/02 06/01

(or translated to hex for modern users: <C2 61>)

What you are missing here is that C2, while being the "bit
combination" representing the "diacritical mark" "non-spacing
acute accent", is *not* a member of the encoded character
repertoire. <C2 61> *is*. <C2> is *not*.

To drive home this point, the standard says:

<quote>
The names of the characters and non-spacing diacritical marks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
are specified in Table 3. In order to stress that non-spacing
diacritical marks are not characters, the names given to them
^^^^^^^^^^^^^^^^^^
are printed in lower case italics.
</quote>

[ emphasis added by me ]

And, in Annex C:

<quote>
  NOTE: The term "non-spacing diacritical mark" is used in
  this International STandard in a metaphorical sense only.
  The use of non-spacing diacritical marks is limited to
  combinations implied by the following table:

  Table C.1 - Combinations of diacritical marks and basic letters
</quote>

This kind of clarification, added in the wake of Unicode
and 10646 with their willingness to encode non-spacing marks
*as* productive characters, was insisted upon by the editor
of ISO 6937 and by the principle users of ISO 6937 (including
the then Netherlands NB representative), in part to make it
crystal clear that ISO 6937 did *not* encode non-spacing
combining mark *characters*, but only bit combinations
representing diacritics that in combination with basic
Latin letters created 2-octet sequences that encoded a
single graphic character.

O.k., get it?

> while
> Unicode allows arbitrary combinations. (This reflects the display
> technology available at the respective times of origin.)
>
> Now it just so happens that all compositions in ISO 6937/2 comprise
> only one diacritic (plus one base character, of course), which lets
> ISO 6937/2 appear similar to a multibyte coded character set; however,
> the intent apparently was a composition of one, or several, diacritics
> with a base character (cf. definition 3.19, quoted supra) -- only
> the original plans to encode characters for more languages (that may
> carry more than one diacritical mark) never have been realized.

The scope of ISO 6937 states, among other things, that it:

b) specfies a repertoire of the Latin alphabetic and non-alphabetic
    characters for the communication of text in many European
    languages using the Latin script;

I don't believe it ever was scoped to have universal intent for
its coverage, nor for it to go beyond the list of enumerated
accented characters now listed in Table C.1.

Sure, if ISO 6937 had seen wider implementation and been more
successful, at some point, making use of the limited set of
bit combinations representing diacritics, it might eventually
have been extended by adding a few combinations like g-hacek
and k-hacek for Skolt Sami, o-ogonek for Sami, h-hacek for
Finnish Romany, and so on. But I seriously doubt it would
have have attempted to include a mechanism for two diacritics
on a letter, nor have any kind of productivity. The *characters*
were those listed in the repertoire, not the diacritics.

--Ken

Next message: Kenneth Whistler: "RE: Decomposed vs Composed accented characters"
Previous message: Kent Karlsson: "RE: Decomposed vs Composed accented characters"
Maybe in reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Kenneth Whistler: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:36:37 CST