Re: Plain text: Amendment 1

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Mon Jul 05 1999 - 18:48:46 EDT


> On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote:
> > 3) could be something like one out of 3:
> >
> > 1. CR
> > 2. LF
> > 3. CR LF
>
> To clarify: I think "line break" could follow the conventions
> currently in use on the Internet: Accept all of the three above forms,
> but only generate one form, preferably the CR LF sequence.
>
> It seems like the Internet is going to standardize on UTF-8,
> and as UTF-8 encodes C0 as a single octet, I think there would be
> much sense in chosing a C0 sequence for the "line break" function.
>
> I think the paragraph break could then be chosen as one of
> the C0 Information separators, possibly the Record Separator
> aka control-^ .
>
I think the problem with this idea is that if we look at a Unicode
text file and see CR and/or LF in it, we don't know if those
characters came from the private text format of a 7- or 8-bit file
that was converted to Unicode without any record-format conversion,
or if they are the "Unicode" CR and LF. Therefore this would only
move the problem of incompatible record formats from the old world
(of DOS, Windows, UNIX, Macintosh) to the new one.

It's better to have Unicode characters LS and PS (and I think also
Tab/Column-Separator and Page Separator) than to recycle the C0
controls. This ensures round-trip integrity without having to know
the history of the data ("it came originally from DOS so to convert
it from Unicode to UNIX we need to...")

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT