Doug Felt here at Taligent was kind enough to take a pass at answering
your questions. His comments are marked with "**". I have added on in a
few places, marked with "@@", but haven't looked at the examples as
carefully as Doug.
I've been trying to get a clear picture of what a "plain-text unicode
file" should look like (wrt control chars, bidi markup, &c.).
By "plain-text unicode file" I mean something that would be output by a
plain-text editor, eg. a Unicode-capable vi (Unix) or brief (DOS). No
or Web implications (altho such an editor could certainly be used to
prepare multi-lingual Web pages).
I have prepared a short text (not semantically very meaningful) with
directionalites so I can ask some concrete questions. I took the liberty
to attach the GIF to this message (about same size as the text).
Postscript and GIF versions of this text can also be seen at URL
Below, the text is shown in logical order (and all in English), with an
indication of the language in the postscript page (A=Arabic, E=English,
F=French, G=German, Y=Yiddish), and what I believe the levels should be.
Some examples of dates. In Yiddish, "Monday, the 24th February 1997".
1 E................................E Y............................Y
In German, "Monday, the 24th Febrary 1997".
2 E.......E G...........................G
In Arabic, "Saturday March 90\3\10" (March 10, 1990)
3 E.......E A....................A E............E
"Shindler's List", so is called my favorite film. The jew has in the
4 E.............E Y...............................................Y
ring written: "All who preserve one soul of Israel the book makes up
him as if he preserved a whole world.".
The guest has been in Berlin. He has said: "I am 49 years
7 Y........................................Y G...........G
old and am called Boutros". This means in Yiddish: "I am old 49 years
8 G...............G A.....A Y...................Y
am called Boutros" (Pierre in French).
9 Y.......Y B.....B F....F Y.......Y
o Translations are fairly literal (and not always very accurate): just
for general orientation. And there are surely imperfections in all
but the French (with just my name, I'm pretty safe here).
o line 3: I'm not too sure what the logical order of the date in Arabic
is. Could be 10\3\90 (levels 2212122 -- three level-2 numbers
level-1 backslashes) or 90\3\10 (all at level 2). Not too sure of the
exact translation of words either.
** The logical order is, in general, the spoken order. The fields of
** would probably appear in the order the putative speaker would say
** however this is one place where writing and speaking can diverge.
** it depends on the order in which the putative speaker would type
** My description of what follows assumes the order you present is
** and the desired appearance is what you present on your web site.
** Now as to the levels: This is very long, bear with me.
** Solidus (Slash) U+002F is a European Number Separator (ES).
** Reverse Solidus (Backslash) U+005C is Other Neutral (ON). You use
** reverse solidus but I'm not sure if this is to represent mirroring
** character is mirrored). Either way, neither is a strong directional
** If the digits are Roman, by rule P0 all these numbers are treated as
** Arabic Numerals because the preceeding strong directional character
** is Arabic text (the 'h' in March). You may have intended them to be
** Arabic-Indic digits from the start. Either way, the digits are AN.
** If you intended Solidus (ES) this is converted to ON by rule P3. So
** either solidus or reverse solidus is ON.
** ON between AN is converted to R by rule N3(c).
** The quoted string on line 3 is thus "L R... AN AN R AN R AN AN L"
** the L characters are the quote marks surrounding the text. The
** base line direction is LTR because of the initial L (Roman 'I'), so
** the base level is 0. In rule I1 the levels thus become
** "0 1... 2 2 1 2 1 2 2 0". By application of rule L2 this first
** "Saturday March 09\3\01" as the level 2 runs are reversed, then
** "10\3\90 hcraM yadrutaS" as the levels 1&2 run is reversed.
** This is not consistent with the output on your web page. To force
** date to be formatted left to right assuming this logical order, you'd
** need to force all date characters to L. This can be done either
** before the first Roman digit, if the digits are roman, or by
** the date with LRO..PDF, if the digits are arabic-indic. Note that
** won't work because the reverse solidus, being between two AN, would
** still convert to R, instead of L as desired.
** For example, using "Saturday March [LRE]90\3\10[PDF]",
** assuming Arabic-indic digits, would resolve the levels to
** 01111111111111112443434420, progressively resulting in
** "Saturday March 09\3\01" -- level 4 reversed
** "Saturday March 10\3\90" -- levels 3 and above reversed
** "Saturday March 09\3\01" -- levels 2 and above reversed
** "10\3\90 hcraM yadrutaS" -- levels 1 and above reversed
** This is a direct result of the fact that the date is not a
** solid run of left-to-right text, because the solidus is still R.
** "Saturday March [LRO]90\3\10[PDF]" however would resolve to
** 01111111111111112222222220, progressively resulting in
** "Saturday March 01\3\09" -- level 2 reversed
** "90\3\10 hcraM yadretaS" -- level 1 reversed.
o Quotes aren't the right ones (some should be low quotes, ...).
1) Do the levels in the above make sense (plus/minus some punctuation)?
It may be that I've totally misunderstood levels.
** Generally, they make sense, see my discussion above. Text does not
** necessarily change level simply because of a quotation, or because of
** a change in language. So in line 2, the level wouldn't change simply
** because of a switch from English to German, since the German
** characters would be L. Only LRE or LRO would do that. Since you
** don't indicate strong formatting characters, I'd have to assume they
** were present to force the levels you indicate.
2) When embedding L2R in L2R (eg German in English, line 2) or R2L in
(eg. Arabic in Yiddish, line 9, or Hebrew in Yiddish, line 5), should
I use LRE/PDF and RLE/PDF (even though the direction doesn't change)?
** Generally, you wouldn't need to.
3) The second and third paragraphs are right-aligned (R2L main
How do I indicate this? I thought of making each paragraph a block
(separating them with PS, paragraph separator), and starting each
with a strong char of the appropriate directionality. In the second
paragraph, this would mean starting the block with RLM (since the
letters are English). Ie. if base level is odd, main directionality
and the text is right aligned.
Or, other possibility, starting a right-adjusted paragraph with RLE?
But then what about a left-adjusted paragraph that starts with R2L
** Either way would work. Alignment depends on the base line direction,
** which is determined by the first strong character in the block. The
** explicit directional formatting codes LRE, RLE, LRO, RLO as well as
** RLM and LRM are all strong directional characters. LTR text within
** a RLE embedding will still format LTR, but the overall run of text
** within the embedding will be RTL.
4) What should I use to separate lines? LS or CR or LF or CR/LF? If I
LS, which is a block separator, doesn't that interact negatively with
markup (control chars), in particular embedding markups? Ie. I have
reestablish the proper level at each line. And what happens with
Couldn't this cause confusion. If I have two lines (in logical order)
000 0000 00 00000 RLE 11 1111 LS | English RLE Yiddish LS
11 11111 1 11111 00 0000 ... | Yiddish English ...
and reissue an RLE at start of second, I can no longer tell whether
I have one embedded segment or two (with a 0-level space between,
the LS is). Could be an issue if I later reformat (reflow) this text
I might want to do in an editor).
As a matter of fact, if the second line (after LS) starts with a
R2L character and I don't reissue RLE, won't the base level be set to
This would put the following English at level 2 (not intended as the
English isn't embedded in the Yiddish here, but the other way
(I haven't read the recent thread on LS very carefully yet, but it's
not too reassuring: lots of opinions)
@@ The standard is pretty clear. Most of those opinions are from people
@@ who have not read it. Think of these characters in terms of what you
@@ use in a word processor.
@@ For Microsoft word or FrontPage, think of LS as the
@@ character that you get with shift-Return
@@ (causing no paragraph spacing or indent),
@@ and PS as what you get with Return.
@@ (on the Mac, this would be option-Return).
** This is a good observation! We believe the current standard is in
** error and should categorize LS as whitespace instead of as a block
** This would allow LS characters to be inserted wherever whitespace
** appears and not interfere with explicit formatting codes.
** That said, the explicit formatting codes are basically intended for
** text interchange only. They pose several problems for editing. One
is that it
** is easy to radically alter the text by inserting, copying, or
** one of these codes. This can reorder the text within the block and
** completely change the text on several lines. Similarly, the default
** base line direction rule can be problematic, as changes to the text
** the start of a block can change the base line direction. Users might
** have difficulty editing unless the editor provides some support (such
** as assisting the user to insert/delete explicit formatting codes and
** their matching PDFs as a unit).
@@ For actual editing of text with different directions, it is far
easier to have
@@ out-of-band style information with explicit embedding levels,
@@ as mentioned briefly on page 3-22.
** Additionally, text reordering after levels are computed is done on a
** line by line basis. Depending on where line breaks occur, different
** text may appear on a line, and in different orders. This is
** of the issue of how to represent line breaks-- if they are
** external to the text (a line break table, based on wrapping to some
** width or character count, say) this still happens. This makes
** lines somewhat more of an issue than it is with ASCII text.
5) Does PS imply LS? Or would I end a paragraph with LS PS?
** Yes, use only PS to separate paragraphs.
6) Imagine I want to start the third paragraph on a new page. Where do I
put the FF (wrt the LS/CR/LF/ and bidi markup in the vicinity)?
** FF is higher-level formatting, you'd have to interpret it separately.
@@ In particular, you would definitely interpret it as a block
7) Any specific bidi markup required around the numerals?
In the Arabic date: if levels intended are 2212122, would I need
markup? I would think I would need:
LRO number PDF \ LRO number PDF \ LRO number PDF
(so that the \s, which are "other neutral", stay at level 1)?
** Almost, see my example above. In your example, the separate runs
** of LTR text would occur in RTL order, reversing the year and day of
** the date from what your example shows.
8) What is the intent (as opposed to the effect which the algo surely
clear) of RLE and LRE? When are they useful? (Relates to question
** Quoted text where the text itself contains mixed directions is a
** case. You can see it (implicitly) in the examples for rule L2. The
** logically belong to the surrounding text, and the embedding codes are
** just inside the quotes.
@@ In the vast majority of cases, it is not necessary. The important
@@ those that Doug mentioned.
@@ RLO and LRO are even more infrequent, and are designed to allow for
@@ such part numbers with mixed numbers and letters, where the character
@@ order is forced.
9) A typesetting question. Where do quotes belong in
texts (eg. in line 7)? Should they be at the same level as the text
introducing the quote? Or at the level of the text being quoted. On
line 7, should the quote be at the end of the line instead of where I
put it (in the PS file)? Can't say I'm comfortable with either
And what style of quotes does one use? That of the quoting or of the
** Quotes are at the same level as the text introducing the quote.
@@ In general, you expect the style of the quotes to be the same as the
@@ text, not the embedded text. However, that is up to the user's
Thanks in advance for any clarifications.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT