Re: Printing and Displaying Dependent Vowels

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 29 2004 - 19:28:08 EST

  • Next message: John Cowan: "Re: Windows and Mac character encoding questions"

    Peter Kirk said:

    > I will say again as I have said before - but the above (and what I
    > snipped) is extra evidence for it - that what is broke ... is
    > the rule that the isolated (generally spacing) form of a combining mark
    > should be formed by SPACE or NBSP followed by the combining mark.

    This has been the *intent* of the standard since its inception in
    1989.

    > There
    > are many good reasons for not using SPACE for this, including default
    > behaviour like inserting line breaks immediately after SPACE.

    Nope. UAX #14 specifies the following regarding SPACE followed by
    combining marks:

    "If U+0020 SPACE is used as a base character, it is treated as AL
    instead of SP."

    This means that a combining character sequence of this type is treated
    as a unit for the purposes of line breaking, and this overrides the
    behavior otherwise of SPACE to be treated as a line break
    opportunity. Of course UAX #14 only spells out default behavior,
    but then "default behaviour" is what was claimed just above.

    > Using NBSP rather than SPACE has several advantages, and has long been
    > specified in Unicode, although not widely implemented. It is less likely
    > to occur accidentally. But it has disadvantages, especially that it will
    > always be a spacing character, whereas for display of isolated Indic
    > vowels no extra spacing is required.

    NBSP is not a fixed-width space.

    > I would like to repeat my earlier proposal for a new character ISOLATED
    > COMBINING MARK BASE. This character would have no glyph, and the general
    > properties of a letter. Its spacing would be just as much as required
    > for proper display of the combining mark - which would be zero for
    > combining marks which have their own width.

    And after 15 years presence in the standard (or its earlier drafts)
    of the SP + CM recommendation, what makes you think that introduction
    of a *new* convention using a *new*, special purpose format control
    character sorta like a space only different, would lead to any
    better situation in actual practice? Use of such a character would
    *NOT* resolve the differences regarding how to display such a
    combination, by the way.

    > I realise that for backward compatibility reasons the old encoding
    > cannot be made illegal. But it can be deprecated, and a note can be
    > added that this sequence may not always be displayed as preferred.

    This is a recipe for prolonging the confusion and inconsistency in
    implementations of this feature.

    > You can't get away with it that easily. If the standard specifies that
    > <space, combining mark> should be displayed as an isolated combining
    > mark, then it would be conformant for a partial implementation to
    > display this sequence as nothing or as an illegal sequence. But if the
    > system attempts to display the sequence in a meaningful manner, it must
    > do so according to the standard, i.e. not as dotted circle plus
    > combining mark.

    The standard does not *require* this rendering or anything else. For
    the most part, the Unicode Standard is *NOT* a text rendering
    standard -- it is a character encoding standard. All kinds of
    recommendations are put in regarding how to handle one kind or another
    of rendering problem, precisely so that every implementer doesn't
    start from scratch to reinvent the wheel here, and so as to provide
    some basis for people to represent the same text content with the
    same "spellings" for complex scripts.

    There are reasons why such recommendations are found in Chapters 7
    (and 5 and 2) of the standard, and are not nailed down with
    conformance clauses in Chapter 3. The UTC has, over the years, not
    found it appropriate to try to make normative requirements on the
    details of text display, except insofar (as in the Bidirectional
    Algorithm) as they have a direct bearing on the interpretation of
    the logical content of the text itself.

    > Well, as I understand it NBSP is often expected to be a fixed-width
    > space, and it is in many implementations. In fact I think it ought to
    > be, whether or not this is actually specified. But there ought to be a
    > character which is explicitly NOT fixed width to carry NSMs.

    There are *two* such characters: SPACE and NBSP.

    John Cowan noted:

    > Well, it depends on what the equivoque "combining marks" in the title of
    > Section 7.7 means.

    and then quoted the relevant text from p. 187. By the way, the first
    part of that text has survived almost verbatim from Unicode 1.0, where
    it was printed on p. 40 in what was then Chapter 3, Character Blocks.
    It was written there as part of the section "Generic Diacritical
    Marks U+0300 --> U+036F", as that was the most obviously a propos
    point in the text at the time. The text of the standard has since
    been morphed, restructured, and extensively added to, but some of
    its quirks result from the fact that the text has a *history*, and
    it isn't completely rewritten every time a new book is published.

    The intent of the UTC and the editors has always seemed clear to
    me on this particular point -- and the fact that the text in
    question has survived 3 major reeditings of the entire standard
    without significant change indicates to me that this has not been
    a problematical part of the standard for the UTC.

    > So assuming that "combining mark" means "combinining character" rather than
    > "non-spacing mark" (the term does not appear in the Glossary), it seems that
    > combining vowels should work fine with SP or NBSP.

    This, however, is a textual problem which should be addressed.
    As it stands, Section 7.7, Combining Marks deals with various
    types of combining characters, including non-spacing combining
    marks and enclosing combining characters. It does not say
    anything explicit about Indic dependent vowels, in part because
    of its textual history.

    Peter Kirk continued:

    > But it is a source of great confusion to
    > everyone when a widely used application does something clearly different
    > from what the standard intends, and yet claims "conformance" even if
    > technically this is correct.

    What the standard intends is that the textual representation (encoding)
    of an isolated combining mark be done via the sequence <SP, CM>.
    It does not *require* or *not require* that the visual rendering
    of such a sequence be done with or without a dotted circle indicating
    the absence of an expected normal base letter. In fact, the standard
    itself makes widespread and explicit use of the convention to display
    such combinations *with* a dotted circle.

    > It seems, from what Srivas (Avarangal) wrote, to be part of the
    > requirement for correct display of Tamil, and perhaps other Indic
    > languages, to be able to display isolated forms of such characters as
    > U+0BC6. If Uniscribe does not support this, even if it is technically
    > Unicode conformant, Microsoft cannot claim to support Tamil and other
    > languages.

    It is a *meta*requirement, required for text *about* the writing
    system. That may be an important requirement, but it is a specialized
    requirement, and it is silly to turn that into a claim that
    "Microsoft cannot claim to support Tamil and other languages."

    That's a silly as claiming that a JIS X 0208 conformant computer
    system does not support Japanese because it doesn't have a specified
    way to write stroke-order writing learning books that show
    Japanese characters written one stroke at a time. Yes, you can
    show a genuine need to produce such publications in Japan, but
    that doesn't mean that the character encoding standard has to spell
    out how to produce them.

    > But a claim to support particular scripts or languages
    > surely implies that all characters in that script (or at least in its
    > modern form) are supported. That is not perhaps a Unicode requirement,
    > but at least in the UK a failure here might be a breach of laws on
    > truthful advertising and description of products.

    Puh-leeez.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Mar 29 2004 - 20:10:49 EST