Re: Plain text: Amendment 1

From: Mark Davis (mark@macchiato.com)
Date: Mon Jul 05 1999 - 23:34:09 EDT


A lot of the discussion of line termination relates to technical report #13.
Any suggestions for additional information for that report would be welcome.

(http://www.unicode.org/unicode/reports/tr13/)

Mark

Frank da Cruz wrote:

> > On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote:
> > > 3) could be something like one out of 3:
> > >
> > > 1. CR
> > > 2. LF
> > > 3. CR LF
> >
> > To clarify: I think "line break" could follow the conventions
> > currently in use on the Internet: Accept all of the three above forms,
> > but only generate one form, preferably the CR LF sequence.
> >
> > It seems like the Internet is going to standardize on UTF-8,
> > and as UTF-8 encodes C0 as a single octet, I think there would be
> > much sense in chosing a C0 sequence for the "line break" function.
> >
> > I think the paragraph break could then be chosen as one of
> > the C0 Information separators, possibly the Record Separator
> > aka control-^ .
> >
> I think the problem with this idea is that if we look at a Unicode
> text file and see CR and/or LF in it, we don't know if those
> characters came from the private text format of a 7- or 8-bit file
> that was converted to Unicode without any record-format conversion,
> or if they are the "Unicode" CR and LF. Therefore this would only
> move the problem of incompatible record formats from the old world
> (of DOS, Windows, UNIX, Macintosh) to the new one.
>
> It's better to have Unicode characters LS and PS (and I think also
> Tab/Column-Separator and Page Separator) than to recycle the C0
> controls. This ensures round-trip integrity without having to know
> the history of the data ("it came originally from DOS so to convert
> it from Unicode to UNIX we need to...")
>
> - Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT