RE: Decomposed vs Composed accented characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Mon Apr 10 2006 - 13:33:48 CST

Next message: Kenneth Whistler: "Re: Decomposed vs Composed accented characters"

Previous message: Tay, William: "RE: Decomposed vs Composed accented characters"
In reply to: Otto Stolz: "Re: Decomposed vs Composed accented characters"
Next in thread: Tay, William: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Otto Stolz wrote:
> Among other codes, I had mentioned ISO 6937:
...
> More specifically, this was from section 4.4 "Coded representations",
> subsection a "Accented letters and umlauts".
>
> Now, Kent Karlsson has written:
> > That text is at best misleading; I'd say it's completely wrong.
> > In actual fact, ISO/IEC 6937 does not encode any combining
> > characters, absolutely NONE whatsoever. Nor does it rely at all
> > on any kind of composition.
>
> I have quoted from the 1983 version of that standard. I have no
> easy access to its 1994, and 2001, versions. So, the parts that
> I have quoted may, or may not, have been superseeded. If Kent
> Karlson can quote the essential clauses from the current (2001)
> version that invalidate my old version, I will be glad to learn
> that the gist of that standard has completely been changed within
> two revisions.

No, no change. That misleading explanation of the design approach
is still there. See http://std.dkuug.dk/jtc1/sc2/open/02n3239.pdf
for a 1998 'Committee Draft' text.

> Definition from ISO 6937/1-1983:
> > 3.19 composite graphic symbol: A graphic symbol consisting of a
> > combination of two or more other graphic symbols in a single
> > character position, such as a diacritical mark an a basic letter,
> > for example ä.
>
> So, that version clearly conveys the notion of combining diacritic
> marks and base characters. This is exactly what William Tay had asked
> about; so I think it was important to mention that standard. Kent,
> thank you for reminding us to ISO 646, as well, which I had forgotten
> to mention.

Still misleading. The actual technical construction is that of lead
bytes (in the range C0-CF) that *indicates* the accents in the
*precomposed* characters encoded in 6937.

Look at the table of encoded characters in table 4. There is not a
single COMBINING character encoded, whether to be before or after
a base character. This multibyte encoding is constructed to look
like there is an "accent + base", but in actual fact that is not the
case.

So, table 4 is the key here. Not the somewhat clumsy explanation of
the overall design (sometimes sidestepped) of the multibyte encoding.

> Kent Karlsson also has written:
> > But [in ISO/IEC 6937] the lead byte NEVER encodes any combining
> > character.
>
> I cannot understand the distinction Kent draws between a "non-spacing
> diacritical mark" (cf. quote from ISO 6937/2, supra), and a "combining
> character". It is just a technical detail, whether the base character
> is encoded first (as in Unicode), or last (as in ISO 6937).

Look at table 4.

> > [ISO/IEC 6937] is a multibyte encoding, where lead bytes (with the
> > 8th bit set) sort of indicate the accent of the character (but that
> > does not always hold true) and the trail byte (if a
> double-byte code)
> > indicates the base character (except when the trail byte is the one
> > for space).
>
> The essential difference between ISO 6937 and Unicode is that
> ISO 6937 defines a closed inventory of combined characters, while

There are no "combined" characters in 6937. There are quite a number
of what Unicode calls *precomposed* characters, except that there is no
composition in 6937.

> Unicode allows arbitrary combinations. (This reflects the display
> technology available at the respective times of origin.)
>
> Now it just so happens that all compositions in ISO 6937/2 comprise
> only one diacritic (plus one base character, of course), which lets
> ISO 6937/2 appear similar to a multibyte coded character set; however,
> the intent apparently was a composition of one, or several, diacritics
> with a base character (cf. definition 3.19, quoted supra) -- only
> the original plans to encode characters for more languages (that may
> carry more than one diacritical mark) never have been realized.

6937 *is* a multibyte coded character encoding. But if you don't look
closely enough, it appears similar to an encoding with combining
characters (given before the base); but that it is definitely not. Look
at table 4 again.

/kent k

Next message: Kenneth Whistler: "Re: Decomposed vs Composed accented characters"
Previous message: Tay, William: "RE: Decomposed vs Composed accented characters"
In reply to: Otto Stolz: "Re: Decomposed vs Composed accented characters"
Next in thread: Tay, William: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:32:25 CST