From: Kenneth Whistler (email@example.com)
Date: Mon Mar 29 2004 - 19:28:08 EST
Peter Kirk said:
> I will say again as I have said before - but the above (and what I
> snipped) is extra evidence for it - that what is broke ... is
> the rule that the isolated (generally spacing) form of a combining mark
> should be formed by SPACE or NBSP followed by the combining mark.
This has been the *intent* of the standard since its inception in
> are many good reasons for not using SPACE for this, including default
> behaviour like inserting line breaks immediately after SPACE.
Nope. UAX #14 specifies the following regarding SPACE followed by
"If U+0020 SPACE is used as a base character, it is treated as AL
instead of SP."
This means that a combining character sequence of this type is treated
as a unit for the purposes of line breaking, and this overrides the
behavior otherwise of SPACE to be treated as a line break
opportunity. Of course UAX #14 only spells out default behavior,
but then "default behaviour" is what was claimed just above.
> Using NBSP rather than SPACE has several advantages, and has long been
> specified in Unicode, although not widely implemented. It is less likely
> to occur accidentally. But it has disadvantages, especially that it will
> always be a spacing character, whereas for display of isolated Indic
> vowels no extra spacing is required.
NBSP is not a fixed-width space.
> I would like to repeat my earlier proposal for a new character ISOLATED
> COMBINING MARK BASE. This character would have no glyph, and the general
> properties of a letter. Its spacing would be just as much as required
> for proper display of the combining mark - which would be zero for
> combining marks which have their own width.
And after 15 years presence in the standard (or its earlier drafts)
of the SP + CM recommendation, what makes you think that introduction
of a *new* convention using a *new*, special purpose format control
character sorta like a space only different, would lead to any
better situation in actual practice? Use of such a character would
*NOT* resolve the differences regarding how to display such a
combination, by the way.
> I realise that for backward compatibility reasons the old encoding
> cannot be made illegal. But it can be deprecated, and a note can be
> added that this sequence may not always be displayed as preferred.
This is a recipe for prolonging the confusion and inconsistency in
implementations of this feature.
> You can't get away with it that easily. If the standard specifies that
> <space, combining mark> should be displayed as an isolated combining
> mark, then it would be conformant for a partial implementation to
> display this sequence as nothing or as an illegal sequence. But if the
> system attempts to display the sequence in a meaningful manner, it must
> do so according to the standard, i.e. not as dotted circle plus
> combining mark.
The standard does not *require* this rendering or anything else. For
the most part, the Unicode Standard is *NOT* a text rendering
standard -- it is a character encoding standard. All kinds of
recommendations are put in regarding how to handle one kind or another
of rendering problem, precisely so that every implementer doesn't
start from scratch to reinvent the wheel here, and so as to provide
some basis for people to represent the same text content with the
same "spellings" for complex scripts.
There are reasons why such recommendations are found in Chapters 7
(and 5 and 2) of the standard, and are not nailed down with
conformance clauses in Chapter 3. The UTC has, over the years, not
found it appropriate to try to make normative requirements on the
details of text display, except insofar (as in the Bidirectional
Algorithm) as they have a direct bearing on the interpretation of
the logical content of the text itself.
> Well, as I understand it NBSP is often expected to be a fixed-width
> space, and it is in many implementations. In fact I think it ought to
> be, whether or not this is actually specified. But there ought to be a
> character which is explicitly NOT fixed width to carry NSMs.
There are *two* such characters: SPACE and NBSP.
John Cowan noted:
> Well, it depends on what the equivoque "combining marks" in the title of
> Section 7.7 means.
and then quoted the relevant text from p. 187. By the way, the first
part of that text has survived almost verbatim from Unicode 1.0, where
it was printed on p. 40 in what was then Chapter 3, Character Blocks.
It was written there as part of the section "Generic Diacritical
Marks U+0300 --> U+036F", as that was the most obviously a propos
point in the text at the time. The text of the standard has since
been morphed, restructured, and extensively added to, but some of
its quirks result from the fact that the text has a *history*, and
it isn't completely rewritten every time a new book is published.
The intent of the UTC and the editors has always seemed clear to
me on this particular point -- and the fact that the text in
question has survived 3 major reeditings of the entire standard
without significant change indicates to me that this has not been
a problematical part of the standard for the UTC.
> So assuming that "combining mark" means "combinining character" rather than
> "non-spacing mark" (the term does not appear in the Glossary), it seems that
> combining vowels should work fine with SP or NBSP.
This, however, is a textual problem which should be addressed.
As it stands, Section 7.7, Combining Marks deals with various
types of combining characters, including non-spacing combining
marks and enclosing combining characters. It does not say
anything explicit about Indic dependent vowels, in part because
of its textual history.
Peter Kirk continued:
> But it is a source of great confusion to
> everyone when a widely used application does something clearly different
> from what the standard intends, and yet claims "conformance" even if
> technically this is correct.
What the standard intends is that the textual representation (encoding)
of an isolated combining mark be done via the sequence <SP, CM>.
It does not *require* or *not require* that the visual rendering
of such a sequence be done with or without a dotted circle indicating
the absence of an expected normal base letter. In fact, the standard
itself makes widespread and explicit use of the convention to display
such combinations *with* a dotted circle.
> It seems, from what Srivas (Avarangal) wrote, to be part of the
> requirement for correct display of Tamil, and perhaps other Indic
> languages, to be able to display isolated forms of such characters as
> U+0BC6. If Uniscribe does not support this, even if it is technically
> Unicode conformant, Microsoft cannot claim to support Tamil and other
It is a *meta*requirement, required for text *about* the writing
system. That may be an important requirement, but it is a specialized
requirement, and it is silly to turn that into a claim that
"Microsoft cannot claim to support Tamil and other languages."
That's a silly as claiming that a JIS X 0208 conformant computer
system does not support Japanese because it doesn't have a specified
way to write stroke-order writing learning books that show
Japanese characters written one stroke at a time. Yes, you can
show a genuine need to produce such publications in Japan, but
that doesn't mean that the character encoding standard has to spell
out how to produce them.
> But a claim to support particular scripts or languages
> surely implies that all characters in that script (or at least in its
> modern form) are supported. That is not perhaps a Unicode requirement,
> but at least in the UK a failure here might be a breach of laws on
> truthful advertising and description of products.
This archive was generated by hypermail 2.1.5 : Mon Mar 29 2004 - 20:10:49 EST