Re: Unicode plain-text file

From: Pierre Lewis (lew@nortel.ca)
Date: Wed May 21 1997 - 13:53:00 EDT


Doug/Mark,
Thanks a lot for your answers. They clarify a lot of things.

> ** This is not consistent with the output on your web page. To force the
> ** date to be formatted left to right assuming this logical order, you'd
> ** need to force all date characters to L. This can be done either using an LRM
> ** before the first Roman digit, if the digits are roman, or by surrounding
> ** the date with LRO..PDF, if the digits are arabic-indic. Note that LRE
> ** won't work because the reverse solidus, being between two AN, would
> ** still convert to R, instead of L as desired.

I finally had a chance to chat with my Arab friend to whom I owe this
short fragment. It is visually correct (on GIF/PS), but my logical
ordering was worng. The logical order is 10\3\90. So it seems that
things should automatically fall into place with no extra markup.

It is a reverse solidus. The digits are arabic-indic (U+066x).
So the reverse solidus, an ON, stays R as needed by virtue of the ANs
being treated as Rs for the purpose of resolving neutrals. Not simple,
but effective. That section of the standard really requires careful
reading and exploring :-).

> ... So in line 2, the level wouldn't change simply
> ** because of a switch from English to German, since the German
> ** characters would be L. Only LRE or LRO would do that. Since you
> ** don't indicate strong formatting characters, I'd have to assume they
> ** were present to force the levels you indicate.

The levels as shown are what I believe(d) they should be. I didn't
include the required BIDI markup, but would assume that the application
that outputs the file for this text would include whatever is necessary
to achieve this result. So you assumed correctly.

> @@ The standard is pretty clear. Most of those opinions are from people
> @@ who have not read it. Think of these characters in terms of what you
> @@ use in a word processor.
> @@ For Microsoft word or FrontPage, think of LS as the
> @@ character that you get with shift-Return
> @@ (causing no paragraph spacing or indent),
> @@ and PS as what you get with Return.
> @@ (on the Mac, this would be option-Return).

Thinking in terms of a word processor is what I'm trying to get away
from, because it's not really open. (And I live on Unix :-))

When I open up a file using vi on Unix, I can't tell if this file was
created with vi, emacs, pine, ed, sed, awk or whatever. There are still
issues (CR/LF/CRLF, TAB, FF placement, top 128 codes) with plain-text
ASCII files, but still, it is a very useful concept. Imagine if I had
to open mail from user A with vi, from user B with emacs, from user C
with pine because that's what each used to write to me. It would be
chaos.

Unfortunately, if we can't agree on some conventions for plain-text
Unicode files, we're going to get into this situation to some extent.
Right now, if I want to be as flexible as possible (in an editor, say),
I have to deal with 4 new-line conventions (maybe 5): CR, LF, CRLF, LS,
maybe NL. I have to deal with various placements of FFs. And I may have
to deal with various uses and misuses of some of the new codes.

> ** This is a good observation! We believe the current standard is in
> ** error and should categorize LS as whitespace instead of as a block
> ** separator.

I'll consider it changed.

> ** That said, the explicit formatting codes are basically intended for static
> ** text interchange only. They pose several problems for editing. One is that it
> ** is easy to radically alter the text by inserting, copying, or deleting

I wouldn't let a user directly input/modify BIDI markup! Rather I'd have
him/her tell the editor what a piece of text should look like, then let
the editor issue whatever markup is required to achieve this at the time
the file is written out.

> ** FF is higher-level formatting, you'd have to interpret it separately.
> @@ In particular, you would definitely interpret it as a block separator.

That's one area where I'd love more guidance from Unicode. FF is, I think,
a reasonable requirement for plain-text files, so I would have liked
Unicode to tell me more about it, or provide a PAS -- page separator.

Pierre
lew@nortel.ca

P.S.1. I was shocked, when I visited the IUC10 Web site, to find HTML
pages in Unicode, but no plain-text files. Yes, let Unicode be able to
stand on its own (as fdc@watsun.cc.columbia.edu writes)!

P.S.2. Btw, one thing I love about "plain-text" files is that they have
the best chances of surviving. If I write stuff today that my 3-year
old will want to read when he turns 33, my only choice is plain text.
To write for him in French, plain-text ASCII (with the Latin1
assumption) is just fine. But if I wanted to add some notes in Greek,
Russian or Yiddish, I need more than just the ASCII conventions and
Latin1 codepage.

P.S.3. Someone in this thread stated that LF was a paragraph separator
in Unix. I see it as a line separator.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT