Unicode Technical Report #13
Unicode Newline Guidelines

Revision 3
Authors Mark Davis
Date 1998-04-01
This Version http://www.unicode.org/unicode/reports/tr13-3
Previous Version n/a
Latest Version http://www.unicode.org/unicode/reports/tr13

Summary

This document describes guidelines for how to handle different characters used to represent new lines on different platforms.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to errata@unicode.org.


Introduction

Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or new line (NL). Unfortunately, not only are newlines represented by different characters on different platforms, they also have ambiguous behavior even on the same platform. Especially with the advent of the web, where text on a single machine can arise from many sources, this causes a significant problem.

Unfortunately, these characters are often transcoded directly into the corresponding Unicode codes when a character set is transcoded; this means that even programs handling pure Unicode have to deal with the problems.

Scope

The following describes a set of recommendations for handling these characters so as to minimize the effects on users.

Definitions

The following table provides hexadecimal values for the acronyms used in the text. The Unicode Standard does not formally assign control characters, instead in provides the 65 code values for use as in the 7 and 8-bit standards. See The Unicode Standard, Version 2.0, Section 2.6 Controls and Control Sequences.

Abbreviations

 

 ASCII

 EBCDIC

 Unicode

 CR

 0D  0D  000D

 LF

 0A  25  000A

 CRLF

 0D,0A  0D,25  000D,000A

 NL*

 n/a  15  0015

 VT

 0B  0B  000B

 FF

 0C  0C  000C

 LS

 n/a  n/a  2028

 PS

 n/a  n/a  2029

For clarity, when referring to the function that a particular character has, we will use lowercase (e.g., paragraph separator); when referring to the specific characters that represent those functions, we will use titlecase or an acronym (e.g., Paragraph Separator or PS).]

The term NLF stands for different characters, depending on the platform, that is, any of CR, LF, CRLF, or NL.


Background

A paragraph separator is used to indicate a separation between paragraphs, while a line separator indicates where a line break alone should occur, typically within a paragraph. For example:

This is a paragraph with a line separator at this point,
causing the word "causing" to appear on a different line, but not causing the typical paragraph indentation, sentence-breaking, line spacing, or change in flush (right, center or left paragraphs).

For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to <P>. In word processors, paragraph separators are usually entered using a keyboard RETURN or ENTER; line separators are usually entered using a modified RETURN or ENTER, such as SHIFT-ENTER.

A record separator is used to separate records. For example, when exchanging tabular data, a common format is to tab-separate the cells, and use a CRLF at the end of a line of cells. This function is not precisely the same as line separation, but the same characters are often used.

Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Notepad programs supplied with the Macintosh or Windows system software interpret their platform's NLF as a paragraph separator.

Once NLF was reinterpreted to stand for a paragraph separator, in some cases some other control character was impressed into service as a line separator. For example, vertical tabulation VT is used in Microsoft Word. However, the choice of character for line separator is even less standardized than the choice of character for NLF.

Yet, there is a lot of legacy text on older systems that treats NLF as a line separator, including internet email gateways, so you can't just simply treat NLF as a paragraph separator.


Recommendations

The Unicode Standard defines two unambiguous separator characters, Paragraph Separator (PS = 202916) and Line Separator (LS = 202816). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous. Otherwise, the following specifies how to cope with an NLF when converting from other character sets to Unicode, when interpreting characters in Unicode, and when converting to other character sets.

Note: Even if you know which character(s) represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NL the same. Only on output do you need to distinguish between them.

Converting from other character code sets

  1. If you do know the exact usage of any NLF, then convert it to LS or PS.
  2. If you don't know the exact usage of any NLF, remap it to your platform NLF. (This doesn't really help you in interpreting Unicode text unless you are the only source of that text, since someone else may have left in LF, CR, CRLF, or NL.)

Interpreting characters in text

  1. Always interpret PS as paragraph separator and LS as line separator.
  2. In word processing, interpret any NLF the same as PS.
  3. In text processing (e.g. program editors), interpret any NLF the same as LS.
  4. In parsing, choose the safest interpretation. For example, if you are dealing with sentence-break heuristics, you would reason in the following way that it is safer to interpret any NLF as a LS:

Converting to other character code sets

  1. If you know the intended target, map NLF, LS, and PS appropriately, depending on the target conventions. For example, for Microsoft Word you would map LS to VT, and PS and any NLF to CRLF.
  2. If you don't know the intended target, map NLF, LS, and PS to the platform newline convention (CR, LF, CRLF, or NL). In Java, for example, this is done by mapping to a string nlf, defined as:
    String nlf = System.getProperties("line.separator");

Input and Output

  1. A readline function should stop at NLF, LS, FF, or PS. In the typical implementation it does not include the NLF, LS, PS, or FF that caused it to stop. Note that since the separator is lost, the use of readline is limited to text processing, where there is no difference among the flavors of separators.
  2. A writeline (or newline) function should convert NLF, LS, and PS according to the conventions above, under "Converting to other character code sets".
  3. In C, gets is defined to terminate at a newline and replaces the newline with '\0', while fgets is defined to terminate at a newline and includes the newline in the array it copies the data into. C implementations interpret '\n' either as LF or as the underlying platform newline NLF depending on where it occurs. EBCDIC C compilers substitute the relevant codes, based on the EBCDIC execution set.

Page Separator

FF is commonly used as a page separator, and it should be interpreted that way in text. In most parsing, or in readline, this amounts to interpreting it in the same way as a LS. When displaying on the screen, it causes the text after the separator to be forced to the next page. It should be independent of paragraph separation: a paragraph can start on one page and continue on the next page.


Copyright

Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html