DRAFT Unicode Technical Report #14

Line Breaking Properties

Revision 0.3
Authors Asmus Freytag
Date May 19, 1998
This Version http://www.unicode.org/unicode/reports/dtr14-03.html
Previous Version -none-
Latest Version http://www.unicode.org/unicode/reports/dtr14.html

Summary

This report presents the specification of line breaking properties for Unicode characters.

Status of this document

This draft is published for public review . Previous versions of this document have been considered by the Unicode Technical Committee, and it has had preliminary approval as a Draft Unicode Technical Report. The Unicode Technical Committee may approve, reject, or further amend this document before it becomes an approved Unicode Technical Report. This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to the author.

Line Breaking Property

 Overview and Scope

The Unicode Standard, Version 2.0, has tended to treat the line-breaking behavior of characters as self-evident. This technical report intends to discover best practice and capture it via formally assigned line breaking properties. This version of the report assigns line-breaking properties to those characters that have a specific function in the process of line breaking. Default line-breaking properties for other classes of characters are the subject of future revisions, extensions or of additional technical reports.

Definitions

All terms not defined here shall be as defined in the Unicode Standard.

Line fitting - the process of determining the how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.

Overfull - a line that contains so much text that it does not fit in the space allotted, or only after compression of the text.

Underfull - a line that contains so little text that it ends too far from the margin, or one that would require unacceptable amounts of expansion.

Line Break - the position in the text where one line ends and the next one starts.

Line Break Opportunity - a place where a line is allowed to end. Whether a given position in the text is a valid line break opportunity depends on the line breaking rules in force, as well as on context.

Line Breaking - the process of selecting that part of a text that can be displayed on a line. In other words, selecting among several line breaking opportunities such that the resulting line is neither overfull nor underfull (unless the user requested an explicit line break).

Line Breaking Property - A character property with the following, mutually exclusive values:

Explicitly Breaking - characters with this property explicitly cause a line break.

Inseparable - characters with this property prevent a line break between pairs.

Non-breaking - characters with this property prevent line breaks before or after.

Attached - characters with this property prevent a line break between the character and the preceding character.

Contingent Break Opportunity - characters with this property provide a line break opportunity contingent on additional information.

Break Opportunity (Before/After) - characters with this property generally provide a line break opportunity before or after the character respectively.

Default - the property for all other characters

 

Description

Lines are broken as result of explicit line breaking characters or as a result of a formatting algorithm selecting among available line breaking opportunities the particular one that best results in a ‘full’ but not ‘overfull’ line.

Three styles of context analysis determine line-breaking opportunities,

  1. space-based
  2. anywhere, unless prohibited
  3. morphological analysis

The first is commonly used for scripts employing the space character. The second is used with East Asian ideographic scripts. The third is used for scripts such as Thai, which do not use spaces, but which restrict word-breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm.

NOTR: Korean may alternately use a space-based (style 1) instead of the style 2 context analysis.

Space-based line breaking is often augmented by hyphenation. Hyphenation provides additional line breaking opportunities within a word. Some Unicode characters have explicit line breaking properties assigned to them. These can be used for the first and second type context analysis for line break opportunities. For multilingual text, styles one and two can be unified into a single set of specifications.

NOTE: Interpretation of line breaking properties is strictly independent of formatting bi-directional text.

Specification

The following sections list Unicode characters grouped by their line breaking property and provides additional description of their line breaking behavior.

Explicitly breaking characters

Explicit breaks act independently of the surrounding characters.

PAGE SEPARATOR (FF) — U+000C

Form Feed separates a page. The text on the new page starts at the beginning of the line. No paragraph formatting is applied.

LINE SEPARATOR (LS) — U+2028

The text after the Line Separator starts at the beginning of the line. No paragraph formatting is applied.

This is similar to HTML <BR>

PARAGRAPH SEPARATOR (PS) — U+2029

The text of the new paragraph starts at the beginning of the line. Paragraph formatting is applied. This is similar to HTML <P>

"NEW LINE FUNCTION (NLF)"

New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the Unicode equivalents of NL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in Unicode Technical Report 13, Unicode Newline Policy.

Inseparable characters

These characters are intended to be used in consecutive sequence. They therefore prevent line breaks absolutely in a series of two character of this class.

ONE DOT LEADER — U+2024

TWO DOT LEADER — U+2025

HORIZONTAL ELLIPSIS — U+2026

Horizontal ellipsis can be used as a three dot leader.

EM DASH — U+2014

This character is used to set off parenthetical text, normally without spaces. Line breaks can occur before and after an em dash, but not between two em dashes. Pairs of em dashes are often used instead of quotation dash.

 

Non-breaking or "glue" characters

The action of these characters is to glue together both left and right neighbor character such that they are kept on the same line.

ZERO WIDTH NO-BREAK SPACE (ZWNBSP) — U+FEFF

Since this character is not visible, it is the preferred choice for keeping characters together that would otherwise be split across the line break under a style 2 line break. In particular, surrounding SPACE with

ZWNBSP prevents it from acting as a line break opportunity.

NO BREAK SPACE (NBSP) — U+00A0

This is the preferred character to use where two words should be visually separated but kept on the same line, as in the case of a title and a name "Dr.<NBSP>Joseph Becker".

FIGURE SPACE — U+2007

This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

NON-BREAKING HYPHEN (NBHY) — U+2011

This is the preferred character to use where words must be hyphenated but may not be broken at the hyphen.

 

Contingent break opportunity characters

Contingent Break Opportunity Before and After

OBJECT REPLACEMENT CHARACTER — U+FFFC

By default there is a break opportunity both before and after the object. Object-specific line break behavior is implemented in the object itself, and may override the default to rule out either or both of the break opportunities.

Break opportunity after characters

Breaking Spaces

SPACE (SP) — U+0020

The space characters are explicit break opportunities. The last character measured for fit is BEFORE the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line.

It is sometimes convenient to use SP, but not the other breaking spaces to override context based behavior of other characters under the "anywhere, except where prohibited" style of line breaking (context analysis style 2).

EN QUAD — U+2000

EM QUAD — U+2001

EN SPACE — U+2002

EM SPACE — U+2003

THREE-PER-EM SPACE — U+2004

FOUR-PER-EM SPACE — U+2005

SIX-PER-EM SPACE — U+2006

PUNCTUATION SPACE — U+2008

THIN SPACE — U+2009

HAIR SPACE — U+200A

The preceding list of characters all have a specific width, but behave otherwise as breaking spaces .

ZERO WIDTH SPACE (ZWSP) — U+200B

This character does not have width. It is used in a style 2 context analysis to provide additional (invisible) break opportunities.

IDEOGRAPHIC SPACE — U+3000

This character has the width of an ideograph but like ZWSP is fully subject to the style 2 context analysis.

.

Tabs

Except for the effect of the location of the tabstops, the tab character acts similarly to a space for the purpose of line breaking.

TAB — U+0009

 

Breaking Hyphens

Breaking hyphens establish explicit break opportunities immediately after each occurrence.

There are three types of hyphens: Explicit hyphens, conditional hyphens, and dictionary-inserted hyphens (as a result of a hyphenation process). There is no character code for the third kind of hyphen; therefore if it is desired to make the distinction, the dictionary-inserted hyphens must be represented out of band, or with a privately assigned control code.

HYPHEN — U+2010

ARMENIAN HYPHEN — U+058A

Hyphens are graphic characters with width. .Since, unlike spaces, they print, they are included in the measured part of the preceding line

HYPHEN-MINUS — U+002D

Some additional context analysis is required to distinguish usage of this character as a hyphen from the use as minus sign (or indicator of numerical range). If used as hyphen, it acts like HYPHEN.

NOTE: In some practice runs of HYPHEN-MINUS are used to stand in for longer dashes or horizontal rules. If it is desired to treat them like the characters or layout elements they stand for, and actual character code conversion is not performed, line breaking will need to support these special cases explicitly.

SOFT HYPHEN (SHY) — U+00AD

SHY is rendered invisibly and has no width, EXCEPT at a line break. Some languages require a change in spelling surrounding an optional hyphen. The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.

 

Attached characters

Combining characters

Combining character sequences are treated as units for the purposes of line breaking. The line-breaking behavior of the sequence is that of the base character.

NOTE: If SPACE is used to show combining characters in isolation and the line is broken after the space character, the next line would start with the combining characters. In this case they are rendered as if they followed a space. As a result, it is always possible to maintain the correct rendering for combining character sequences and still process space characters in an optimized way.

 

Default: All other characters

The line-breaking behavior of all other characters is context dependent and subject to script or language-based conventions and, sometimes, additional choices by the user. Specific default property assignments for these characters are the subject of future editions of this or other technical reports.

 

Additional information

Dictionary usage

Dictionaries follow strict standards that guide their use of characters to indicate features of the terms listed. Some of these conventions mark places that can also serve as line breaking opportunities and therefore interact with line breaking and are described here.

<note for reviewers: the input set for this section was limited. Additional information, with examples, welcome. >

Break opportunities after characters

HYPHENATION POINT — U+2027

Hyphenation point is primarily used to visibly indicate syllabification of words. Syllable breaks are potential line breaking opportunities in the middle of words. The hyphenation point It is mainly used in dictionaries and similar works. When an actual line breaking opportunity falls inside a word containing hyphenation point characters, the hyphenation point is rendered as a regular hyphen at the end of the line.

ACUTE ACCENT — U+00B4

In dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent would move to the next line, and the preceding line ended with a hyphen. [Confirm]

VERTICAL BAR — U+007C

In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW is used to mark stressed syllables, so all breaks are marked by the vertical bar. For an actual break opportunity, the vertical bar is rendered as a hyphen.

 

Break opportunities before characters

MODIFIER LETTER VERTICAL LINE — U+02C8

MODIFIER LETTER LOW VERTICAL LINE — U+02CC

These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Therefore, the only sensible way to break them is to keep them with the syllable. The line breaker should break *before* them.

NOTE: It is hard to find actual examples in most dictionaries, since the pronunciation fields usually occur right after the head word, and the columns are wide enough to prevent line breaks in the pronunciations.

 


Acknowledgements

The initial assignments of properties are based on input by Michel Suignard. Ken Whistler, Mark Davis, Rick McGowan and the other members of the editorial committee provided valuable feedback.

Changes from previous revisions:

First draft technical report version. Some formatting to fit the template. Many minor updates for clarity. Extended set of characters for dictionary usage.

Copyright

Copyright 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/techreports.html