LS and RS (was: Re: Whitespace characters in Unicode)

From: Doug Ewell <doug_at_ewellic.org>
Date: Sat, 6 Aug 2016 12:30:31 -0600

Markus Scherer wrote:

> In hindsight, LS and PS are probably mistakes. When we came up
> with Pattern_White_Space, we still liked the idea of unambiguous
> end-of-line controls, but in practice it looks like no one really uses
> them. Anyone who cares uses markup or rich-text formats. (Markup was
> not common when Unicode was "born".)

I've often felt that the rise of UTF-8 spelled the end for LS and PS.

Unicode was originally a completely new text format, exactly 16 bits per
character. Conversion to ASCII and other byte-based encodings was an
explicit process. Dedicated characters for LS and PS were a
simplification, removing the notorious confusion over CR versus LF
versus CRLF.

UTF-8 brought ASCII backward compatibility to Unicode, removing early
objections that "Unicode will double my text size" but requiring
continued use of ASCII controls to maintain that compatibility.
Implementers saw the existing CR/LF/CRLF muddle as a problem already
solved, and LS and PS as new complications with no historical
justification.

Additionally, in UTF-8, either LS or PS actually takes more bytes than
CR plus LF, so the "increased text size" argument also discouraged use
of the new controls.

--
Doug Ewell | Thornton, CO, US | ewellic.org
Received on Sat Aug 06 2016 - 13:31:15 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 06 2016 - 13:31:15 CDT