Re: Questions on ZWNBS - for line initial holam plus alef

From: Peter Kirk (
Date: Mon Aug 11 2003 - 19:28:19 EDT

  • Next message: Kenneth Whistler: "Unicode 4.0 is online at last!"

    On 11/08/2003 12:26, Kenneth Whistler wrote:

    >Peter Kirk wrote:
    >>I think this may be a "Peter mistake". I meant to refer to spacing
    >>diacritics. Sorry.
    >>It is certainly highly inappropriate for spacing diacritics to
    >>be considered word boundaries.
    >Why? It is entirely dependent on the orthography and conventions
    >involved. ...
    Well, agreed, there may be orthographic conventions in which a spacing
    diacritic is considered a word boundary or a break opportunity e.g. if
    used like a hyphen. But there are other mechanisms for forcing a word
    boundary where otherwise there would not be one. Are there to suppress a
    word boundary? Perhaps I need to encode <WJ, space, diacritic, WJ> to
    avoid the word boundary implication? Would this work?

    >... There is probably as much (or more) bad ASCII usage
    >of spacing diacritics like `this', where a grave accent character
    >is being misapplied to make a directional quotation mark, as
    >there is actual, linguistically appropriate use of spacing
    But this is an abuse of the spacing diacritic as punctuation. Proper,
    linguistically appropriate use of spacing diacritics should not be
    broken in order to support abuse. Or, if the standard wants to support
    such abuse, we can reserve <space, diacritic> for the abuse and define
    a new character XXX such that <XXX, diacritic> has the properties for
    the linguistically appropriate use.

    >Also, everyone should consider carefully the status of UAX #29,
    >Text Boundaries.
    >2 Conformance
    >This is informative material. There are many different ways to
    >divide text elements corresponding to grapheme clusters, words
    >and sentences, and the Unicode Standard and this document do not
    >restrict the ways in which implementations can do this.
    >This specification is a <emphasis>default</emphasis> mechanism;
    >more sophisticated engines can and should tailor it for particular
    >locales or environments. ...
    >The whole UAX is informative. ...
    Then let it be correctly informative and not full of misinformation. And
    let its default mechanism and recommendations be appropriate for the
    majority of uses, including such cases as list of diacritics which may
    occur in any orthography.

    Ken, it seems to me all the more clearly from looking at the latest
    batch of postings on this list that the <space, diacritic> mechanism
    defined by Unicode is fundamentally flawed. It works, but it creates a
    serious and needless complication for all kinds of other processes,
    including rendering and higher level processes. These processes cannot
    simply take a space as a space and process it as such. Every time they
    come across a space (which is very often!) they have to test whether it
    is followed by a combining character, and if it is they have to treat
    that space specially. This has created a serious problem for
    implementers, which is why they have produced non-conforming
    implementations - and we are not talking about small companies which
    have rushed into the market recently, we are talking about Microsoft,
    among others, which has been sponsoring Unicode for the start, I
    understand. Surely the UTC should not create difficulties for
    implementers and then just shout at them for getting things wrong. The
    UTC should try to produce a standard which is workable without
    unnecessary complications

    I agree that it works better to use NBSP here. There are fewer such
    problems, but they have not gone away entirely. And NBSP is more likely
    to be treated by implementers (in the absence of other guidelines from
    Unicode) as fixed width, not trimmed to the width needed for the diacritic.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 21:14:14 EDT