Re: Line Separator Character

From: Martin J. Duerst (
Date: Fri May 16 1997 - 16:16:48 EDT

On Fri, 16 May 1997, Pierre Lewis wrote:

> Context: plain text unicode file.

There are basically two models of plain text. The first is line-oriented,
the second is paragraph-oriented. Email or programm code is the traditional
example of line-oriented plain text. Descriptive text as it appears in
word processors, minus formatting, is the typical example of paragraph-
oriented plain text.

In traditional encoding (using CR/LF/CRLF) and in "official" Unicode
encoding (using PS), the two models are made compatible by treating
each line in the line-oriented plain text as a paragraph. On the other
hand, the paragraph-oriented model can be reduced to the line-oriented
model by splitting lines in a particular layout of the paragraph.
This splitting is again done by paragraph separators (CR/LF/CRLF/PS),
and not by LS.

LS is only used for certain effects in the paragraph-oriented model
that occur inside a paragraph. For example, I use it in some wordprocessors
to start an new line without having the last line aligned left in a
justified paragraph and/or without having the new line alligning
indented like a first line of a paragraph. The use to avoid paragraph
interspacing has also been mentionned. In summary, LS is an advanced
device for paragraph-oriented plain text, and not to be used for
line-oriented plain text.

That said, let's now look at BIDI:

> Assuming we use LS to separate lines (I guess there's no answer to the
> question "what should I use"), then doesn't that interact negatively
> with bidi markup, in particular embedding markups? Ie. I have to
> reestablish the proper embedding level at each line.
> Say I have two lines, some English with embedded Yiddish (levels shown
> here, in logical order):
> 000 0000 00 00000 RLE 11 1111 NL | English RLE Yiddish NL
> 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ...
> Now if the newline (NL in above) is indicated by a LS (\u2028), the
> bidi state is reset between the lines. If I now start the second line
> with RLE (so as to say I'm reestablishing an embedding level), I can no
> longer tell whether I have one embedded segment or two (with a 0-level
> space between, where the LS is). Could be an issue if I later reformat
> (reflow) this text (as I might want to do in an editor).
> As a matter of fact, if the second line (after LS) starts with a strong
> R2L character and I don't reissue RLE, won't the base level be set to 1?
> This would put the following English at level 2 (not intended as the
> English isn't embedded in the Yiddish here, but the other way around).

LS is defined as a block separator, so you are right. When you
insert an LS to split the lines, your application could insert
arbitrary additional codepoints such as RLE. What it does insert
(or not) is outside of the Unicode BIDI spec, which only describes
static behaviour (what has to happen when the insertions are done),
and not dynamic interactive behaviour (which can be a lot more
complex if you want it to follow user's expectations, and given
that static BIDI is already difficult, I hope you get the point :-).

But when you edit BIDI text, you really should work with
paragraph-oriented plain text, without additional LSs. Then
everything will run more or less smoothly. Reformatting (reflow)
is done automatically and correctly. In those cases where you
indeed insert LSs, they will in most cases not be in the middle
of text, but at some logical interruption point, without the
need for frequent reflow.

> These problems go away if I use any combinations of CR/LF to indicate
> newline.

This might be a solution for some very special cases. But in general,
for BIDI you should use paragraph-oriented plain text, with CR/LF/
CRLF/PS as paragraph separators. I'm pretty sure that when Microsoft
implements BIDI (or the way they already do it), they will treat
CR (what they use internally) as a block separator in the BIDI

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT