From: Philippe Verdy (firstname.lastname@example.org)
Date: Wed Aug 06 2003 - 09:32:37 EDT
On Wednesday, August 06, 2003 12:36 PM, Kent Karlsson <email@example.com> wrote:
> > The NFD decompositions of spacing marks is alredy defined as a SPACE
> > plus a non-spacing combining character.
> Philippe, please! Those are *compatibility* decompositions. The
> normal form NFD only uses *canonical* decompositions. And there is no
> such thing as "NFD decompositions".
Sorry for the confusion. Still even with a NFKD decomposition, it is clear that
they already define combining sequences with the SPACE used as a base
character. The real important thing is that the SPACE is already the base
character already used as a combining mark holder, and Unicode processing
should only be done without breaking in the middle of a combining sequence
even in the case of a SPACE base character.
It's true that not all (only most) combining non-spacing characters have a
non-combining spacing counterpart. But when they exist, the decompositions
proposed in the UCD are already an indication that the SPACE character
should be preserved and not considered for break oppotunities if it is followed
by a combining character. It is not extremely clear in the specification break
properties where sequences of spaces are often unified, but there's already
some rules that make it clear: a SPACE is a word separator only if not used
in a combining sequence, and break opportunities are computed between
grapheme clusters which cannot break a combining sequence.
OK there's a problem with HTML, where sequences of whitespaces are
normalized to a single whitespace, and this effectively creates a problem
if a combining character is used after two spaces: the first one being a
word separator or indenting space, the second being a base for the
combining sequence. For now, most text can be created using spacing
diacritics instead of combining sequences starting by SPACE, and this
will work in HTML.
For those diacritics which do not have a spacing counterpart already
defined, there remains a problem which can only be solved using a
separating format control between the first (separating)
space and the second (base) space. I think this could be a ZWSP
...<SPACE>, <ZWSP>, <SPACE, COMBINING-ACUTE-ACCENT>...
(provided that the whitespace normalization algorithm will not
include <ZWSP> in the whitespaces sequence and treat it
isolately, something that a conforming HTML or XML processor
should not do, as it should unify only sequences of <SPACE>,
<TAB>, <CR>, <LF>, and only according to the context of the
containing element whitespace properties controlling the
normalization of XML whitespace sequences (leading, trailing,
line break preservation, tabulator)...
I did no verify completely in XSLT but this should be true too
there for this kind of processing (hoping that ZWSP will not
be considered in whitespace sequences)
-- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 10:27:49 EDT