From: Philippe Verdy (email@example.com)
Date: Fri Aug 08 2003 - 06:12:40 EDT
On Thursday, August 07, 2003 8:06 PM, Peter Kirk <firstname.lastname@example.org> wrote:
> On 06/08/2003 15:47, Philippe Verdy wrote:
> > On Wednesday, August 06, 2003 11:48 PM, Peter Kirk
> > <email@example.com> wrote:
> > > OK, what kind of markup should I use, in any well-known markup
> > > language, to ensure that an isolated diacritic is centred in the
> > > space between the words before and after it?
> > >
> > >
> > In plain text, I think that this encoding:
> > ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
> > startOfWord2...
> > is what you need, as it creates the following combining sequences:
> > <...endOfWord1>, <SPACE>, <SPACE, diacritic>, <SPACE>,
> > <startOfWord2...>
> Thank you, Philippe. This is where we started. But I noted that some
> current implementations render the space diacritic combination as a
> width space with the diacritic not centred over it. I suggested that
> this was wrong, that the diacritic should be centred. Doug suggested I
> used markup outside the scope of Unicode.
> > ...
> > Another similar case would be the use of a isolated nukta (which
> > normally modifies a following base character): the sequence
> > <nukta, SPACE> is a single combining sequence with a break
> > opportunity. So a sequence like <nukta, SPACE, acute accent>
> > would be unbreakable but would include a break opportunity at its
> > end, unless it is followed by a NBSP.
> > And the sequence <nukta, NBSP, acute accent> would also be
> > unbreakable either in the middle or on both ends.
> Tell me more about these nuktas which modify a FOLLOWING base
> This is just what I have been told is illegal, non-conformant or
> something. But if this is allowed for nuktas, why shouldn't it be
> allowed for Hebrew holam?
Sorry, I should have checked my code to see which character exactly
has a combining feature with the following base character. In fact there's
already a special treatment for nukta, which gets internally swapped in
front of its base character for glyph processing, and this was a source
of confusion for me (yes nuktas have CC=7 and are combined with the
previous base character, but only with the standard Unicode encoding
sequence, but not in all legacy codepages, and not for some other
text processings that put it in front.
In fact, I may have discussed about the Candrabindu, which is combining
with CC=230 (above?), except in the Devenagari, Bengali, Gujarati,
Oriya scripts where they are combining but as base character (CC=0),
and in Telugu and Gurmukhi (Adak Bindi) where it is Mc instead of Mn
and is not combining.
This reflects a different usage of the Candrabindu in ISCII, and this is
a source of difficulty when transcoding from ISCII to Unicode...
And I'm not sure if the CC=230 for the Tibetan Candrabindu is really
accurate with its specific combining model.
The treatment of Anusvara (or Tibetan JeSuNgaRo or Gurmukhi Bindi
or Sinhala Anusvaraya) as a combining character with CC=0 is also
script specific, as it is either Mc or Mn. The same thing may be said
about Visarga signs (or Sinhala Visargaya)
Such special treatment is not needed for the Viramas (CC=9), as it
more or less behaves like a standard vowel sign, i.e. a regular diacritic.
The original encoding model for Indian scripts has lot of legacy text
resources coded with ISCII with a unified model that Unicode treat
more or less specially, but with its own difficulties (we can ignore the
ISCII font controls, or we can consider other ISCII control signs to
manage it like ISO2022 with script switch controls, which are not
encoded in Unicode. Despite what the Unicode reference section
documents in the specific chapter for Brahmic scripts, there's little
help here to avoid the confusions, notably because the same
chapter covers scripts that have been encoded with distinct
character models (notably Thai and Lao).
For now the current text in Unicode 3 seems not very helpful to
disambiguate things, and I hope that this chapter about Indic
scripts will be greatly enhanced to cover the actual usages, and
that Thai and Lao will be discussed separately from other
Indic scripts. For now, I think that the ISCII or TIS620 standards
are much more precise and helpful than the Unicode reference
for the scripts they cover in a different way, with lots of conversion
caveats not explained (at first read this chapter seems to make
a proeminent reference to ISCII and TIS620, but there are
some "quirks" where both references seem to contradict the
actual usage of combining sequences, for which new Unicode
properties should be added and precised (even if combining
classes cannot be changed for stability reason as well as
normalized forms considered canonnically equivalent, or
distinct when in reality they are combining the same way and
one form is considered "normal" and others are non-standard
or defective according to the origin ISCII or TIS620 standard).
-- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
This archive was generated by hypermail 2.1.5 : Fri Aug 08 2003 - 06:57:40 EDT