Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 14 2004 - 05:35:28 CDT

Next message: Anto'nio Martins-Tuva'lkin: "Re: [mo/mol] and [ro/ron/rum]"

Previous message: Philippe Verdy: "Re: [hebrew] ZW(N)J usage, why D17 changed the character model (was: Holam discussion and decision at UTC)"
In reply to: Anto'nio Martins-Tuva'lkin: "Re: Combining across markup?"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)"
Reply: Doug Ewell: "Re: XML and Unicode interoperability comes before HTML or even SGML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Anto'nio Martins-Tuva'lkin" <antonio@tuvalkin.web.pt>
> On 2004.08.11, 18:58, Mike Ayers <mike.ayers@tumbleweed.com> wrote:
>
> > Better yet, have a generic mechanism which allows you to build
>
> Even better yet: Have the WC3 rephrase their demand that no element
> should start with a defective sequence (when considered in separate)
> as that no *block-level* element should etc., and leave things like
> <span>, <i> and other in-line elements free to start with a combining
> character (provided that the said in-line container is not the first
> within a block-level element, of course).

Shamely, the idea of "block-level" and "inline" elements is specific to
HTML,
but HTML today is an application of XML, and the problem must be solved
at the XML level.
The only safe way to solve it at the XML level is to make the use of NCRs
or named character references highly recommanded (if not mandatory)
for the first character of a defective combining sequence. The solution
based on a closed list of exceptions will not work with the evolutions of
the Unicode/ISO/IEC 10646 repertoire of combining characters.

Simply because, for both Unicode and ISO/IEC 10646, the character
model includes the fact that ANY base character forms a combining
character sequence with ANY following combining character or ZW(N)J
character.

The change in D17 of Unicode has also extended the risks of conflicts
because now ZW(N)J can now *also* combine with and follow:

  - a quote mark (used to delimit the start and end of an attribute value),
  - the second minus/hyphen of a XML-comment leading mark,
  - the second opening square bracket of a CDATA leading mark,

The D17 change in Unicode is already creating another interoperability
problem with XML, because in the past ZW(N)J were treated only as
combining sequence *starters* (now they also occur in the middle
of a non-defective combining sequence, or at end of a defective
combining sequence).

One suggestion for this problem is that the change of D17 should be
ignored at the XML parsing level (yes it breaks the new Unicode
character model, but this is not critical to preserve the document
parsed tree data).

The other suggestion for Unicode is to include in the stability policy
some words saying that there will be NO compatibility characters added
with a canonical equivalence to a combining sequence ended by a
ZW(N)J character. (I think or hope this will never happen, as this
should never be needed for roundtrip compatibility...)

All these cases must be first solved at the XML document syntax level,
because these restrictions DO NOT apply at the DOM-tree level where
all combining sequences, defective or not, are VALID in *both*
XML and Unicode!

Using NCRs can solve some of these problems when they occur within
element attribute values or in text elements contents. This won't work
within CDATA sections (the CDATA section needs to be closed first,
and the NCR coded out of this section...)

When all this will be solved, then the recommandation will apply to
XHTML, and can further by applied retroactively to HTML 4.01 and
before.

The only safeguard of XML regarding the cases of ambiguities is that
a XML document generator MUST NOT be created that contains
unassigned Unicode characters. This means that a XML document
generator (or editor) MUST know with which versin of Unicode it
works, so that it will accept only characters that are known to be
or not to be combining characters. This way, the XML document
generators or editors will know exactly how to delimit combining
sequences, and which ones are defective or not, so that they will
use NCRs appropriately.

This also means that a XML editor that loads a valid XML document,
where NCRs are present, must not transform NCRs back to plain-text
characters without knowing that character. A XML generator or editor
that does not know the "combining" property of a character found in
an input document, using its own known version of Unicode, will then
need to save the edited or regenerated document using NCRs for
every character that it does not know!

> That would, IIUC, address WC3's "fears" and would OTOH satisfy the
> need to mark up differently any part of a *text stream*.
>
> (Yes, of course CSS allows blobk elements to behave as in-line and
> vice-versa, buat that would be the user's responsability...)

You also point here the problem that block/inline distinctions will NOT
work. Notably because these distinctions can also be changed
dynamically using DOM scripting!

This is especially true now with XHTML which has become completely
modular, where a valid XHTML document can completely redefine the
role of nearly all block or inline elements, using a custom DTD or XML
schema definition, rendered with a XSLT transformation engine and
a custom XSLT behavior file that will bind XHTML elements to the
browser's renderer!

Next message: Anto'nio Martins-Tuva'lkin: "Re: [mo/mol] and [ro/ron/rum]"
Previous message: Philippe Verdy: "Re: [hebrew] ZW(N)J usage, why D17 changed the character model (was: Holam discussion and decision at UTC)"
In reply to: Anto'nio Martins-Tuva'lkin: "Re: Combining across markup?"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)"
Reply: Doug Ewell: "Re: XML and Unicode interoperability comes before HTML or even SGML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Aug 14 2004 - 05:37:25 CDT