Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (
Date: Mon Aug 11 2003 - 05:54:25 EDT

  • Next message: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"

    On Monday, August 11, 2003 2:05 AM, Kenneth Whistler <> wrote:

    > Um, no. Precisely because it would introduce *another* way
    > to do what is already specified in the standard. It would, I
    > predict, lead to nothing but more trouble.
    > You might, perhaps, find it satisfying, but I can guarantee
    > that there would then be a future critic complaining about
    > an unnecessary distinction introduced into the standard. And
    > then there would be *more* text in different places of the
    > standard to try to correct and change, in an attempt to
    > try to make consistent distinctions between the behavior
    > of <SPACE, NSM> and <ACCENT_ANCHOR, NSM>.

    I don't think so: for texts that are already coded with SPACE+NSM,
    it won't be needed to do changes, as long as applications using
    them are satisfied with their existing behavior, even if it's ambiguous
    or causes problems in other applications. The rule would be not to
    change things, but offer to writers a way to create new texts without
    those ambiguities and problems, and correct them if authors wish it.

    For me, the "ACCENT ANCHOR" if you call it like this, is solving the
    usage of isolated diacritics as plain letters (such as the implied missing
    y in Hebrew Yerushala(y)im), and so would behave like an alphabetic
    character (whose directionality is still to define...)

    Existing coded spacing diacritics are coded as symbols (Sk) and
    mostly for accents used in LTR scripts, so the confusion of these
    symbols with letters behavior in some UAX's which give them the AL
    property (including for one case of SPACE+NSM) is not a problem.

    The usage as symbols is mostly correct for the case where a text is
    speaking about a diacritic as a isolated symbol and not within words
    (this is correct for most languages).

    The usage within words (for an implied missing base letter, including
    when this missing letter is an initial) leaves a distinct hole (for example
    if one was trying to encode a word like "(Y)erushala(y)im", where the
    missing base letter is the initial. For languages like Arabic and South-Asian
    scripts, there's no problem as there already is a base letter to hold
    initial combining vowel signs, which also works for the case of multiple
    combining vowels which should not stack but be writtenon this base
    letter. In fact in those languages, the missing consonnantal base letter
    is actually written with a visible glyph.

    But for Latin, Cyrillic, Greek, Hebrew, and probably other scripts, their
    isolated diacritics are missing a explicit coded form. And there is still
    the need even for Arabic and Brahmic scripts to be able to speak about
    the diacritic itself, without an explicit base letter, and so the SPACE+NSM
    combining sequence is for now the only solution with its undocumented
    properties problems.

    Reread some UAXes to see the problematic impact of SPACE+NSM in
    areas which are NOT related to rendering, notably when extracting word
    sequences (for search and indexing), managing keyboard selections,
    computing line breaks, and handling the directionality. Now consider the
    even greater impact with the legac use os SPACE as a normalizable
    padding whitespace (a key feature of SGML, HTML and XML), and the
    legacy use of SPACE+NSM cause too many problems that won't
    satisfy authors, which in some case will not be able to use it as it will
    not work as expected. Due to these problems, authors are then using
    even worse hacks, like using a control before the NSM, even if it creates
    "defective" combining sequences, and the dotted circle is sometimes
    displayed, and even if it is parsed with an invisible but still additional
    grapheme cluster for the control itself, whose presence is a pollution.

    Instead of forcing authors to use defective combining sequences like
    control+NSM, which would be a even worse hack, why not designating
    a clean and pure invisible base character with the required properties,
    so that it creates a pure combining sequence for the isolated diacritic(s)?

    So the question is which invisible base character(s) to define, with
    which properties?
    - A invisible symbolic base character (Sk), with neutral directionality (I
    called it a INVISIBLE SYMBOL);
    - A invisible letter base character (Lo) with neutral directionality (you call
    it a ACCENT ANCHOR, and I called it a INVISIBLE LETTER), or
    - A invisible letter base character (Lo) with LTR directionality and
    - A invisible letter base character (Lo) with RTL directionality

    Personnally, the term ACCENT ANCHOR seems ambiguous and does
    not indicate precisely the usage (it fits more like the current ambiguous
    usage of SPACE as this anchor for accents), and it seems restrictive to
    the kind of diacritic or other combining mark that may (should?) be
    applied to it. In addition, nothing would forbid to combine several
    diacritics or marks on this base character.

    Consider then these new characters are better base characters than
    SPACE, and define them with only a compatibility decomposition to
    SPACE, to match the previous encoding. If those new base characters
    are used without diacritics, they will be shown like the glyph for NBSP,
    and not necessarily as zero-width (there's no requirement for these
    invisible symbols to be zero-width in all cases, as this is a more precise
    substitution for the legacy SPACE, but without the associated whitespace
    properties). With these new characters, there is no need to change the
    rules in the various UAX's and other Unicode algorithms.

    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.

    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 06:40:40 EDT