Re: about P1 part of BIDI alogrithm

From: Philippe Verdy <>
Date: Tue, 11 Oct 2011 01:00:19 +0200

2011/10/10 Eli Zaretskii <>:
>>  what's the meaning of 'appropriate Newline Functions' and 'higher-level
>> protocol paragraph determination'?
> Newline Function (NLF) is described in Section 5.8 of Unicode.
> Higher-level protocols are described in section 4.3 of UAX#9.  In a
> nutshell, your application can have its own ideas of what begins and
> what ends a paragraph, and you are allowed to use those rules instead
> of what P3 says.

For me I interpret the sentence as including all other non-plain text
mechanisms available in various file formats or interchange protocols,
such as HTML.

But even with HTML (and XML as well), you also have to consider the
case of the behavior of whitespaces: all those newlines or whitespaces
are collapsed by default, unless there's an XML attribute (in XHTML)
saying the opposite, or a default style associated to some HTML
elements (for example "pre" elements, where whitespace:collpase is not
the default). Add to this the additional protocol implied by CSS (that
allowschanging the whitespace behavior by a stylesheet), and then the
classification of whitespaces cannot be resolved at the encoded
doucment level, but only after it has been parsed, and even been
contextually styled (and this behavior can even be changed

In other words: the rich-text protocol applies its own interpretation
first, and then exposes the document within its internal temporary
state, through which the separation of paragraphs (or "blocks") are
separated from "inline" elements and plain-text elements. The Unicode
algorithms will then apply only to the many small fragments of that
are only a part of the document. In many cases, in those formats, you
will never see any newline or paragraph separator in those plain-text
elements or plain-text attribute values. Instead, you will have to
compose with the other out-of-band information exposed by the dynamic
DOM, on which the Unicode standard cannot fix a standard, but just
some guidelines. There are other specificities that are not
representable as plain-text (for example: the "<br/>" element" does
not convert exactly to any newline or paragraph separator, because the
rich-text document has a more complex structure, where blocks are
self-embeddable to other larger blocks, and you cannot clearly
indicate within plain-text how any newline or paragraph separator
restructures the document, such as the block embedding level, which a
conversion from rich-text to plain-text will loose completely).

With the richer set of HTML5 "semantic elements", this is even more
evident: the interpretation can be fully specified, but does not even
fix any presentation (which is still fully stylable, so that all
elements may be visually reordered and repositioned on the rendered
page, or contextually hidden or reorganized according to user
preferences, or selectively displayed and used when reimporting parts
of the document into another one).

My opinion is that the Unicode standard should avoid adding
constraints on those rich-text formats. It should only focus on the
content of plain-text elements, if they are exposed by a mechanism
like the DOM in XML and HTML, and those standards do not define any
other specific behavior. The TUS should only be there to define the
default interpretation and nothing else (if it says something it
should just be informational, to help maintain some level of limited
interoperability, but not normative as there will be lots of
reasonnable exceptions).
Received on Mon Oct 10 2011 - 18:03:54 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 10 2011 - 18:03:55 CDT