Re: Questions on ZWNBS - for line initial holam plus alef

From: Peter Kirk (peter.r.kirk@ntlworld.com)
Date: Wed Aug 13 2003 - 17:42:19 EDT

  • Next message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"

    On 13/08/2003 14:07, Philippe Verdy wrote:

    >I did not notice that the discussion about Hebrew holam male was
    >related.
    >In fact I don't know anything about the hebrew alphabet so I could not
    >understand the semantics discussed, and so di not note that <holam, vav>
    >was a "defective" encoding (in terms of combining sequences).
    >
    >
    Well, it wasn't very releated - although the subject line here "line
    initial holam plus alef" reminds me that it is very near to where we
    started this thread.

    >When using the term "forbidden", it was only related to possible
    >security
    >problems with XML, but the term was certainly too much expeditive.
    >However, given that possible security and parsing issues do exist, the
    >case of <holam, vav> used to encode "holam-male" may be another
    >argument to propose a neutral/invisible base character for combining
    >characters. For the case of Hebrew, it then needs to have a "letter"
    >behavior, but for the case of other isolated diacritics in Latin,Greek
    >Cyrillic, and probably also Hiragana, Katakana (voice marks) it should
    >better be handled as a symbol.
    >
    >I suggested several semantics for this invisible character(s) in a
    >earlier
    >message:
    >- A invisible symbol
    >- An invisible LTR letter
    >- An invisible RTL letter
    >all of them having a *compatibility* decomposition (or NFKD form) as
    >a SPACE like other existing spacing combining marks, but not being
    >canonical equivalent of SPACE (to keep separately the legacy semantics,
    >properties, behavior and known caveats unchanged and
    >implementation/usage-dependant, as they are now with SPACE+NSM
    >which could then be discouraged in Unicode and strongly deprecated
    >in SGML/HTML/XML)
    >
    >
    >
    >
    My latest idea is to use RLM as in effect your "invisible RTL letter".
    So I would encode word or line initial holam male as <RLM, holam, vav>.
    This is technically a defective combining sequence (is that correct?),
    as RLM is a format control character, but the RLM has the double effect
    of keeping the holam separate from any spaces which a higher level
    protocol might put there and ensuring RTL directionality. And I suppose
    the same technique would be legal with any combining character. But of
    course it would all be spoiled if XML were to forbid defective combining
    sequences, which fortunately is unlikely. Actually I suppose you could
    use <RLM, space, combining character> or <LRM...> for your spacing
    diacritics as the RLM or LRM would protect the space from combination
    with any previous space etc. Or perhaps <RLM, NBSP, combining
    character>. As RLM effectively disappears in searches etc, in effect
    you have your compatibility decomposition.

    I note that there is no line break opportunity in <space, NBSP>. But is
    there one after the space in <space, RLM, NBSP>? If so, <RLM, NBSP,
    combining character> has a third advantage, that it gives the right line
    break opportunity when this sequence is word initial, which it wouldn't
    do without the RLM.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:23:53 EDT