Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

From: Philippe Verdy (
Date: Fri Aug 08 2003 - 06:12:40 EDT

  • Next message: Kent Karlsson: "RE: Questions on ZWNBS - for line initial holam plus alef"

    On Thursday, August 07, 2003 8:06 PM, Peter Kirk <> wrote:

    > On 06/08/2003 15:47, Philippe Verdy wrote:
    > > On Wednesday, August 06, 2003 11:48 PM, Peter Kirk
    > > <> wrote:
    > >
    > >
    > >
    > > > OK, what kind of markup should I use, in any well-known markup
    > > > language, to ensure that an isolated diacritic is centred in the
    > > > space between the words before and after it?
    > > >
    > > >
    > >
    > > In plain text, I think that this encoding:
    > > ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
    > > startOfWord2...
    > > is what you need, as it creates the following combining sequences:
    > > <...endOfWord1>, <SPACE>, <SPACE, diacritic>, <SPACE>,
    > > <startOfWord2...>
    > >
    > >
    > Thank you, Philippe. This is where we started. But I noted that some
    > current implementations render the space diacritic combination as a
    > full
    > width space with the diacritic not centred over it. I suggested that
    > this was wrong, that the diacritic should be centred. Doug suggested I
    > used markup outside the scope of Unicode.
    > > ...
    > >
    > > Another similar case would be the use of a isolated nukta (which
    > > normally modifies a following base character): the sequence
    > > <nukta, SPACE> is a single combining sequence with a break
    > > opportunity. So a sequence like <nukta, SPACE, acute accent>
    > > would be unbreakable but would include a break opportunity at its
    > > end, unless it is followed by a NBSP.
    > > And the sequence <nukta, NBSP, acute accent> would also be
    > > unbreakable either in the middle or on both ends.
    > >
    > >
    > >
    > Tell me more about these nuktas which modify a FOLLOWING base
    > character.
    > This is just what I have been told is illegal, non-conformant or
    > something. But if this is allowed for nuktas, why shouldn't it be
    > allowed for Hebrew holam?

    Sorry, I should have checked my code to see which character exactly
    has a combining feature with the following base character. In fact there's
    already a special treatment for nukta, which gets internally swapped in
    front of its base character for glyph processing, and this was a source
    of confusion for me (yes nuktas have CC=7 and are combined with the
    previous base character, but only with the standard Unicode encoding
    sequence, but not in all legacy codepages, and not for some other
    text processings that put it in front.

    In fact, I may have discussed about the Candrabindu, which is combining
    with CC=230 (above?), except in the Devenagari, Bengali, Gujarati,
    Oriya scripts where they are combining but as base character (CC=0),
    and in Telugu and Gurmukhi (Adak Bindi) where it is Mc instead of Mn
    and is not combining.

    This reflects a different usage of the Candrabindu in ISCII, and this is
    a source of difficulty when transcoding from ISCII to Unicode...
    And I'm not sure if the CC=230 for the Tibetan Candrabindu is really
    accurate with its specific combining model.

    The treatment of Anusvara (or Tibetan JeSuNgaRo or Gurmukhi Bindi
    or Sinhala Anusvaraya) as a combining character with CC=0 is also
    script specific, as it is either Mc or Mn. The same thing may be said
    about Visarga signs (or Sinhala Visargaya)

    Such special treatment is not needed for the Viramas (CC=9), as it
    more or less behaves like a standard vowel sign, i.e. a regular diacritic.

    The original encoding model for Indian scripts has lot of legacy text
    resources coded with ISCII with a unified model that Unicode treat
    more or less specially, but with its own difficulties (we can ignore the
    ISCII font controls, or we can consider other ISCII control signs to
    manage it like ISO2022 with script switch controls, which are not
    encoded in Unicode. Despite what the Unicode reference section
    documents in the specific chapter for Brahmic scripts, there's little
    help here to avoid the confusions, notably because the same
    chapter covers scripts that have been encoded with distinct
    character models (notably Thai and Lao).

    For now the current text in Unicode 3 seems not very helpful to
    disambiguate things, and I hope that this chapter about Indic
    scripts will be greatly enhanced to cover the actual usages, and
    that Thai and Lao will be discussed separately from other
    Indic scripts. For now, I think that the ISCII or TIS620 standards
    are much more precise and helpful than the Unicode reference
    for the scripts they cover in a different way, with lots of conversion
    caveats not explained (at first read this chapter seems to make
    a proeminent reference to ISCII and TIS620, but there are
    some "quirks" where both references seem to contradict the
    actual usage of combining sequences, for which new Unicode
    properties should be added and precised (even if combining
    classes cannot be changed for stability reason as well as
    normalized forms considered canonnically equivalent, or
    distinct when in reality they are combining the same way and
    one form is considered "normal" and others are non-standard
    or defective according to the origin ISCII or TIS620 standard).

    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.

    This archive was generated by hypermail 2.1.5 : Fri Aug 08 2003 - 06:57:40 EDT