This document describes guidelines for how to handle different characters used to represent new lines on different platforms.
Status of this document
This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to email@example.com.
Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or new line (NL). Unfortunately, not only are newlines represented by different characters on different platforms, they also have ambiguous behavior even on the same platform. Especially with the advent of the web, where text on a single machine can arise from many sources, this causes a significant problem.
Unfortunately, these characters are often transcoded directly into the corresponding Unicode codes when a character set is transcoded; this means that even programs handling pure Unicode have to deal with the problems.
The following describes a set of recommendations for handling these characters so as to minimize the effects on users.
The following table provides hexadecimal values for the acronyms used in the text. The Unicode Standard does not formally assign control characters, instead in provides the 65 code values for use as in the 7 and 8-bit standards. See The Unicode Standard, Version 2.0, Section 2.6 Controls and Control Sequences.
For clarity, when referring to the function that a particular character has, we will use lowercase (e.g., paragraph separator); when referring to the specific characters that represent those functions, we will use titlecase or an acronym (e.g., Paragraph Separator or PS).]
The term NLF stands for different characters, depending on the platform, that is, any of CR, LF, CRLF, or NL.
A paragraph separator is used to indicate a separation between paragraphs, while a line separator indicates where a line break alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word "causing" to appear on a different line, but not causing the typical paragraph indentation, sentence-breaking, line spacing, or change in flush (right, center or left paragraphs).
For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to <P>. In word processors, paragraph separators are usually entered using a keyboard RETURN or ENTER; line separators are usually entered using a modified RETURN or ENTER, such as SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging tabular data, a common format is to tab-separate the cells, and use a CRLF at the end of a line of cells. This function is not precisely the same as line separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Notepad programs supplied with the Macintosh or Windows system software interpret their platform's NLF as a paragraph separator.
Once NLF was reinterpreted to stand for a paragraph separator, in some cases some other control character was impressed into service as a line separator. For example, vertical tabulation VT is used in Microsoft Word. However, the choice of character for line separator is even less standardized than the choice of character for NLF.
Yet, there is a lot of legacy text on older systems that treats NLF as a line separator, including internet email gateways, so you can't just simply treat NLF as a paragraph separator.
The Unicode Standard defines two unambiguous separator characters, Paragraph Separator (PS = 202916) and Line Separator (LS = 202816). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous. Otherwise, the following specifies how to cope with an NLF when converting from other character sets to Unicode, when interpreting characters in Unicode, and when converting to other character sets.
|Even if you know which character(s) represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NL the same. Only on output do you need to distinguish between them.|
FF is commonly used as a page separator, and it should be interpreted that way in text. In most parsing, or in readline, this amounts to interpreting it in the same way as a LS. When displaying on the screen, it causes the text after the separator to be forced to the next page. It should be independent of paragraph separation: a paragraph can start on one page and continue on the next page.
Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.
Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html