From: Peter Kirk (email@example.com)
Date: Tue Mar 30 2004 - 06:25:04 EST
On 29/03/2004 16:28, Kenneth Whistler wrote:
>>Using NBSP rather than SPACE has several advantages, and has long been
>>specified in Unicode, although not widely implemented. It is less likely
>>to occur accidentally. But it has disadvantages, especially that it will
>>always be a spacing character, whereas for display of isolated Indic
>>vowels no extra spacing is required.
>NBSP is not a fixed-width space.
Yes it is, in Unicode 4.0.0. Ernest quoted from UAX #14 "All other space
characters have fixed width." This may be in the standard by mistake,
but it is in the standard. Asmus says that this will be changed in
4.0.1, but that has not yet been released. If a statement is written in
a standard, even in the introduction to a different section, that is
>>I would like to repeat my earlier proposal for a new character ISOLATED
>>COMBINING MARK BASE. This character would have no glyph, and the general
>>properties of a letter. Its spacing would be just as much as required
>>for proper display of the combining mark - which would be zero for
>>combining marks which have their own width.
>And after 15 years presence in the standard (or its earlier drafts)
>of the SP + CM recommendation, what makes you think that introduction
>of a *new* convention using a *new*, special purpose format control
>character sorta like a space only different, would lead to any
>better situation in actual practice? Use of such a character would
>*NOT* resolve the differences regarding how to display such a
>combination, by the way.
I would be happy for NBSP to be used in this way, now that it has been
clarified that this should not be considered fixed width when followed
by a combining mark. I would like to see a clear recommendation (not a
conformance requirement, I agree) that the sequence <NBSP, non-spacing
combining mark> should be rendered as a spacing version of the mark with
just enough space for the mark and no added glyph. My reason for
preferring NBSP to SPACE is that it is unambiguously non-breaking and (I
think) not a word boundary.
But this doesn't solve the Tamil etc problem as what is needed there is
a non-spacing non-breaking base character which can allow the vowel to
display without the dotted circle. Perhaps ZWJ would be suitable.
>>Well, as I understand it NBSP is often expected to be a fixed-width
>>space, and it is in many implementations. In fact I think it ought to
>>be, whether or not this is actually specified. But there ought to be a
>>character which is explicitly NOT fixed width to carry NSMs.
>There are *two* such characters: SPACE and NBSP.
You mean, there will be in 4.0.1. The problem with SPACE is a different one.
>The intent of the UTC and the editors has always seemed clear to
>me on this particular point -- and the fact that the text in
>question has survived 3 major reeditings of the entire standard
>without significant change indicates to me that this has not been
>a problematical part of the standard for the UTC.
Well, a text needs to be clear to its readers, not just to its authors.
Obviously this text is not clear to readers, even ones as experienced as
John Cowan, and so needs clarification.
>>So assuming that "combining mark" means "combinining character" rather than
>>"non-spacing mark" (the term does not appear in the Glossary), it seems that
>>combining vowels should work fine with SP or NBSP.
>This, however, is a textual problem which should be addressed.
>As it stands, Section 7.7, Combining Marks deals with various
>types of combining characters, including non-spacing combining
>marks and enclosing combining characters. It does not say
>anything explicit about Indic dependent vowels, in part because
>of its textual history.
In that case something clear and sensible needs to be added about Indic
>Peter Kirk continued:
>>But it is a source of great confusion to
>>everyone when a widely used application does something clearly different
>>from what the standard intends, and yet claims "conformance" even if
>>technically this is correct.
>What the standard intends is that the textual representation (encoding)
>of an isolated combining mark be done via the sequence <SP, CM>.
>It does not *require* or *not require* that the visual rendering
>of such a sequence be done with or without a dotted circle indicating
>the absence of an expected normal base letter. In fact, the standard
>itself makes widespread and explicit use of the convention to display
>such combinations *with* a dotted circle.
Well, the standard clearly intends that the character for "a" is
rendered with the glyph "a" and not the glyph "b". It may not formally
require this, but a system which breaks this rule, while possibly
formally conformant, can hardly claim to support Unicode properly.
One convention for display of isolated combining marks is to use a
dotted circle. But this convention is far from universal across all
writing systems. It is wrong to impose it on all systems - except
perhaps in such a context as the Unicode standard text and character
charts where different systems are compared. It is clear that there is
sometimes (and even in Latin script) a requirement to display isolated
combining marks without dotted circles.
>>It seems, from what Srivas (Avarangal) wrote, to be part of the
>>requirement for correct display of Tamil, and perhaps other Indic
>>languages, to be able to display isolated forms of such characters as
>>U+0BC6. If Uniscribe does not support this, even if it is technically
>>Unicode conformant, Microsoft cannot claim to support Tamil and other
>It is a *meta*requirement, required for text *about* the writing
>system. That may be an important requirement, but it is a specialized
>requirement, and it is silly to turn that into a claim that
>"Microsoft cannot claim to support Tamil and other languages."
I don't accept that this is a specialised requirement or
"*meta*requirement". Potentially, any text which includes a list of
characters in the language or script is likely, at least for certain
scripts, to include isolated dependent vowels. Such texts include all
dictionaries, encyclopedias, language learning and literacy materials
etc etc, and even all books with indexes. There are also cases of
isolated dependent vowels being used in variant spellings, abbreviations
etc in other texts. Such texts counted together are likely to constitute
a high proportion of the total corpus in many languages.
I would say that if specific products do not support dictionaries,
indexes or literacy primers in Tamil, they cannot claim to support Tamil.
-- Peter Kirk firstname.lastname@example.org (personal) email@example.com (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Tue Mar 30 2004 - 07:14:19 EST