DRAFT Unicode Technical Report #14

Line Breaking Properties

Revision

4 (as presented at IUC 13)

Authors

Asmus Freytag

Date

July 30, 1998

This Version

http://www.unicode.org/unicode/reports/tr14-4

Previous Version

http://www.unicode.org/unicode/reports/dtr14-03.html

Latest Version

http://www.unicode.org/unicode/reports/tr14

Summary

This report presents the specification of line breaking properties for Unicode characters.

Status of this document

Previous versions of this document have been considered by the Unicode Technical Committee, and it has had preliminary approval as a Draft Unicode Technical Report. The Unicode Technical Committee may approve, reject, or further amend this document before it becomes an approved Unicode Technical Report. This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to the author.

Line Breaking Properties

1.0 Overview and Scope

The Unicode Standard, Version 2.0, has tended to treat the line-breaking behavior of characters as self-evident. This technical report intends to discover best practice and capture it via formally assigned line breaking properties. This version of the report assigns normative line-breaking properties to those characters that have a specific function in the process of line breaking. Default, informative line-breaking properties for all other classes of characters are supplied as well.

2.0 Definitions

All terms not defined here shall be as defined in the Unicode Standard.

Line fitting - the process of determining the how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.

Overfull - a line that contains so much text that it does not fit in the space allotted, or only after unacceptable compression of the text.

Underfull - a line that contains so little text that it ends too far from the margin, or one that would require unacceptable amounts of expansion.

Line Break - the position in the text where one line ends and the next one starts.

Line Break Opportunity - a place where a line is allowed to end. Whether a given position in the text is a valid line break opportunity depends on the line breaking rules in force, as well as on context.

Line Breaking - the process of selecting that part of a text that can be displayed on a line. In other words, selecting one among several line breaking opportunities such that the resulting line is optimal (unless the user requested an explicit line break). Usually 'optimal' simply means that the resulting line is neither overfull nor underfull. High end implementations (TEX is a well known example) may consider more than one line in making the line breaking decision.

Line Breaking Property - A character property with the following, mutually exclusive values:

Explicitly Breaking - characters with this property explicitly cause a line break.

Attached - characters with this property prevent a line break.

Inseparable - characters with this property prevent a line break between pairs.

Non-breaking - characters with this property prevent line breaks before or after.

Attached - characters with this property prevent a line break between the character and the preceding character.

Contingent Break Opportunity - characters with this property provide a line break opportunity contingent on additional information.

Break Opportunity (Before/After) - characters with this property generally provide a line break opportunity before or after the character respectively.

Non Starter - characters with this property prevent a line break before, even across a space

Opening - characters with this property prevent a line break after, even across a space

Closing - characters with this property prevent a line break after, even across a space

Paired - characters with this property act like they are both opening and closing

Exclamation - characters with this property prevent a line break before

Atomic - atomic characters break before or after if paired with atomic

Numeric - numeric characters form numeric expressions for line breaking purposes

Ordinary - alphabetic characters

Complex context - characters with this property provide a line break opportunity contingent on additional, language specific context analysis

Prefix - characters with this property don't break in front of a numeric expression

Postfix - characters with this property don’t break following a numeric expression

Default - the property for all other characters

3.0 Description

Lines are broken as result of either of two conditions. Because of the presence of an explicit line breaking character or as a result of a formatting algorithm selecting one among available line breaking opportunities the particular one that results in the optimal layout of the text. Simple implementations just consider a line at a time, trying to avoid an underfull or overfull line. Algorithms that take into account the interaction of line breaking decisions for the whole paragraph exist. For the purpose of this document, what is important is not so much what defines the optimal amount of text on the line, but how line breaking opportunities are defined.

Three styles of context analysis determine line-breaking opportunities,

1. space-based

1. anywhere, unless prohibited

1. morphological analysis

The first is commonly used for scripts employing the space character. The second is used with East Asian ideographic scripts. The third is used for scripts such as Thai, which do not use spaces, but which restrict word-breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm.

NOTE: Korean may alternately use a space-based (style 1) instead of the style 2 context analysis.

Space-based line breaking is often augmented by hyphenation. Hyphenation provides additional line breaking opportunities within a word. Some Unicode characters have explicit line breaking properties assigned to them. These can be used for the first and second type context analysis for line break opportunities. For multilingual text, styles one and two can be unified into a single set of specifications.

NOTE: Interpretation of line breaking properties is strictly independent of formatting bi-directional text.

  1. Conformance

The line breaking properties are informative, except for the following small subset

5.0 Specification

The following sections list Unicode characters grouped by their line breaking property and provides additional description of their line breaking behavior. Where line breaking properties are mutually exclusive of each other, the earlier one in the list applies. For example an explicitly breaking character provides an unconditional line break even when following a 'no-break' character.

Each section is marked with an annotation for easy reference

A - the property introduces a break opportunity after in all or some contexts

XA - the property prevents a break opportunity after in all or some contexts

B - the property introduces a break opportunity before in all or some contexts

XB - the property prevents a break opportunity before in all or some contexts

P - the property introduces a break opportunity for a pair of same characters

XP - the property prevents a break opportunity

5.1 Explicitly breaking characters (A)

Explicit breaks act independently of the surrounding characters.

PAGE SEPARATOR (FF) — U+000C

Form Feed separates a page. The text on the new page starts at the beginning of the line. No paragraph formatting is applied.

LINE SEPARATOR (LS) — U+2028

The text after the Line Separator starts at the beginning of the line. No paragraph formatting is applied.

This is similar to HTML <BR>

PARAGRAPH SEPARATOR (PS) — U+2029

The text of the new paragraph starts at the beginning of the line. Paragraph formatting is applied. This is similar to HTML <P>

"NEW LINE FUNCTION (NLF)"

New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the Unicode equivalents of NL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in Unicode Technical Report 13, Unicode Newline Policy.

5.2 Attached characters (XB)

Combining characters

Combining character sequences are treated as units for the purposes of line breaking. The line-breaking behavior of the sequence is that of the base character.

NOTE: If SPACE is used to show combining characters in isolation and the line is broken after the space character, the next line would start with the combining characters. In this case they are rendered as if they followed a space. As a result, it is always possible to maintain the correct rendering for combining character sequences and still process space characters in an optimized way.

5.3 Non-breaking or "glue" characters (XB/XA)

The action of these characters is to glue together both left and right neighbor character such that they are kept on the same line. If they follow a space character, they still allow a break

ZERO WIDTH NO-BREAK SPACE (ZWNBSP) — U+FEFF

Since this character is not visible, it is the preferred choice for keeping characters together that would otherwise be split across the line break under a style 2 line break. In particular, surrounding SPACE with

ZWNBSP prevents it from acting as a line break opportunity.

NO BREAK SPACE (NBSP) — U+00A0

This is the preferred character to use where two words should be visually separated but kept on the same line, as in the case of a title and a name "Dr.<NBSP>Joseph Becker".

FIGURE SPACE — U+2007

This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

NON-BREAKING HYPHEN (NBHY) — U+2011

This is the preferred character to use where words must be hyphenated but may not be broken at the hyphen.

5.4 Contingent break opportunity characters (B/A)

Contingent Break Opportunity Before and After

OBJECT REPLACEMENT CHARACTER — U+FFFC

By default there is a break opportunity both before and after the object. Object-specific line break behavior is implemented in the object itself, and may override the default to rule out either or both of the break opportunities.

  1. Inseparable characters (XP)

Leaders

These characters are intended to be used in consecutive sequence. They therefore prevent line breaks absolutely in a series of two character of this class.

ONE DOT LEADER — U+2024

TWO DOT LEADER — U+2025

HORIZONTAL ELLIPSIS — U+2026

Horizontal ellipsis can be used as a three dot leader.

Em Dash

EM DASH — U+2014

This character is used to set off parenthetical text, normally without spaces. Line breaks can occur before and after an em dash, but not between two em dashes. Pairs of em dashes are often used instead of quotation dash.

5.6 Break opportunity after characters (A)

Breaking Spaces

SPACE (SP) — U+0020

The space characters are explicit break opportunities, but spaces at the end of a line are not measured for fit. If there is a sequence of space characters, and breaking after any of the space characters would result in the same visible line, the line breaking position after the last space character in the sequence is the locally most optimal one. In other words, since the last character measured for fit is BEFORE the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line.

It is sometimes convenient to use SP, but not the other breaking spaces to override context based behavior of other characters under the "anywhere, except where prohibited" style of line breaking (context analysis style 2).

EN QUAD — U+2000
EM QUAD — U+2001
EN SPACE — U+2002
EM SPACE — U+2003
THREE-PER-EM SPACE — U+2004
FOUR-PER-EM SPACE — U+2005
SIX-PER-EM SPACE — U+2006
PUNCTUATION SPACE — U+2008
THIN SPACE — U+2009
HAIR SPACE — U+200A

The preceding list of characters all have a specific width, but behave otherwise as breaking spaces .

ZERO WIDTH SPACE (ZWSP) — U+200B

This character does not have width. It is used in a style 2 context analysis to provide additional (invisible) break opportunities.

IDEOGRAPHIC SPACE — U+3000

This character has the width of an ideograph but like ZWSP is fully subject to the style 2 context analysis.

Tabs

Except for the effect of the location of the tabstops, the tab character acts similarly to a space for the purpose of line breaking.

TAB — U+0009

Breaking Hyphens

Breaking hyphens establish explicit break opportunities immediately after each occurrence.

There are three types of hyphens: Explicit hyphens, conditional hyphens, and dictionary-inserted hyphens (as a result of a hyphenation process). There is no character code for the third kind of hyphen; therefore if it is desired to make the distinction, the dictionary-inserted hyphens must be represented out of band, or with a privately assigned control code.

HYPHEN — U+2010
ARMENIAN HYPHEN — U+058A

Hyphens are graphic characters with width. .Since, unlike spaces, they print, they are included in the measured part of the preceding line

HYPHEN-MINUS — U+002D

Some additional context analysis is required to distinguish usage of this character as a hyphen from the use as minus sign (or indicator of numerical range). If used as hyphen, it acts like HYPHEN.

NOTE: In some practice runs of HYPHEN-MINUS are used to stand in for longer dashes or horizontal rules. If it is desired to treat them like the characters or layout elements they stand for, and actual character code conversion is not performed, line breaking will need to support these special cases explicitly.

SOFT HYPHEN (SHY) — U+00AD

SHY is rendered invisibly and has no width, EXCEPT at a line break. Some languages require a change in spelling surrounding an optional hyphen. The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.

5.7 Opening characters (XA)

The opening character of any set of paired punctuation must be kept with the following character

Characters of general category Ps in the Unicode Character Database.

5.8 Closing characters (XB)

The closing character of any set of paired punctuation must be kept with the preceding character

Characters of general category Pe in the Unicode Character Database.

5.18 Syntax characters (A)

URLs are common enough now in regular plain text, that they must be taken into account when assigning general purpose line breaking properties.

SOLIDUS — U+002F

Slash (SOLIDUS) is allowed as an additional, limited break opportunity to improve layout of web addresses

5.9 Paired characters (XB/XA)

Some paired characters can be either opening or closing depending on usage. The default is to treat them as both opening and closing.

Characters of general category Pf or Pi in the Unicode Character Database.

5.10 Non-starters (XB)

Some characters cannot start a line, unless they are following a space

HIRAGANA AND KATAKANA SOUND MARKS
HIRAGANA AND KATAKANA SMALL CHARACTERS
KATAKANA MIDDLE DOTS
IDEOGRAPHIC ITERATION MARK — U+3005
DOUBLE EXCLAMATION MARK — U+203C
FRACTION SLASH — U+2044
WAVE DASH — U+301C

PERIODS

5.11 Exclamation / Interrogation (XB)

EXCLAMATION MARK — U+0021
QUESTION MARK — U+003F
FULLWIDTH EXCLAMATION MARK — U+FF01
FULLWIDTH QUESTION MARK — U+FF1F
SMALL QUESTION MARK — U+FE56
SMALL EXCLAMATION MARK — U+FE57

These behave like Closing characters, except in relation to postfix characters

5.13 Atomic characters (B/A)

Do not require other characters to provide break opportunities, can ordinarily break between pairs.

UNIFIED HAN
HANGUL SYALLABLES
HIRAGANA
KATAKANA

(The Yi script, once encoded, would get this property as well).

5.14 Prefix characters (XA)

Characters that usually precede a numerical expression, may not be separated from following numeric characters, EVEN if space character intervenes.

DOLLAR SIGN — U+0024
REVERSE SOLIDUS — U+005C
POUND SIGN — U+00A3
YEN SIGN — U+00A5
BAHT SIGN — U+0E3F
COLON SIGN — U+20A1
CRUZEIRO SIGN — U+20A2
LIRA SIGN — U+20A4
WON SIGN — U+20A9
NEW SHEQEL SIGN — U+20AA
DONG SIGN — U+20AB
EURO SIGN — U+20AC
NUMERO SIGN — U+2116
SMALL DOLLAR SIGN — U+FE69
FULLWIDTH DOLLAR SIGN — U+FF04
FULLWIDTH POUND SIGN — U+FFE1
FULLWIDTH YEN SIGN — U+FFE5
FULLWIDTH WON SIGN — U+FFE6
all other currency symbols, unless they are postfix

5.15 Postfix characters (XB)

Characters that usually follow a numerical expression, may not be separated from preceding numeric characters or preceding closing characters, EVEN if space character intervenes.

PERCENT SIGN — 0025
CENT SIGN — 00A2
DEGREE SIGN — 00B0
PER MILL SIGN — 2030
PER TEN THOUSAND SIGN — 2031
PRIME — 2032
DOUBLE PRIME — 2033
TRIPLE PRIME — 2034
REVERSED PRIME — 2035
REVERSED DOUBLE PRIME — 2036
REVERSED TRIPLE PRIME — 2037
PESETA SIGN — 20A7
DEGREE CELSIUS — 2103
DEGREE FARENHEIT — 2109
OHM SIGN — 2126
SMALL PERCENT SIGN — FE6A
FULLWIDTH PERCENT SIGN — FF05
FULLWIDTH CENT SIGN — FFE0

5.16 Numeric characters (XP)

Behave like ordinary characters in the context of ordinary characters, activate the prefix and postfix behavior of prefix and postfix characters

Numeric separator characters

COMMA

FULL STOP

5.17 Complex-context dependent characters (P)

Runs of these characters require morphological analysis to determine break opportunities. This is similar to e.g. a hyphenation algorithm. However, hyphenation mainly improves the layout, especially of narrow columns, and is therefore optional. For the characters that have this property, no linebreaks will be found otherwise, therefore complex context analysis is mandatory.

THAI
LAO

5.18 Ordinary characters (XP)

Require other characters to provide break opportunities, otherwise no breaking between pairs of ordinary characters. However, this is tailorable.

5.0 Additional information

Dictionary usage

Dictionaries follow strict standards that guide their use of characters to indicate features of the terms listed. Some of these conventions mark places that can also serve as line breaking opportunities and therefore interact with line breaking and are described here. If implemented, these characters would be inserted in the corresponding property above.

6.1 Non-breaking or "glue" characters (XA/XB)

Some dictionaries use character that looks like a vertical series of four dots to indicate places where there is a syllable, but no break. This character has not been encoded in Unicode.

6.2 Break opportunities after characters (A)

HYPHENATION POINT — U+2027

Hyphenation point is primarily used to visibly indicate syllabification of words. Syllable breaks are potential line breaking opportunities in the middle of words. The hyphenation point It is mainly used in dictionaries and similar works. When an actual line breaking opportunity falls inside a word containing hyphenation point characters, the hyphenation point is rendered as a regular hyphen at the end of the line.

ACUTE ACCENT — U+00B4

In dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent would move to the next line, and the preceding line ended with a hyphen. [Confirm]

VERTICAL BAR — U+007C

In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW is used to mark stressed syllables, so all breaks are marked by the vertical bar. For an actual break opportunity, the vertical bar is rendered as a hyphen.

6.3 Break opportunities before characters (B)

MODIFIER LETTER VERTICAL LINE — U+02C8
MODIFIER LETTER LOW VERTICAL LINE — U+02CC

These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Therefore, the only sensible way to break them is to keep them with the syllable. The line breaker should break *before* them.

NOTE: It is hard to find actual examples in most dictionaries, since the pronunciation fields usually occur right after the head word, and the columns are wide enough to prevent line breaks in the pronunciations.


7. 0 Implementation notes

The Unicode Standard, Version 2.0, describes a particular method for boundary detection in Chapter 5. It is based on a set of hierarchical rules and character classifications. That algorithm would be well suited for implementation of some of the advanced heuristics.

A simpler algorithm can be devised that uses a two dimensional table to resolve break opportunities between pairs or characters.

7.1 Rule based algorithm

Recall the line boundary rules from page 5-23 of the Unicode Standard.

(1) Ά ‡

(2) Sp Nsm* ‡ ¬(Sp | Nsm | Nb)

(3) Ideo Nsm* ‡ ¬Close

(4) ¬Open Nsm* ‡ Ideo

Here Sp is the space character, Ideo approximates characters with atomic line break property, Nsm are non-spacing marks, Nb are non-breaking characters, and Open and Close are the set of characters with the similar property introduced here. As they stands the set of rules are not sufficient to cover the full set of line breaking properties, but it should be possible to extend them, using the syntax from page 5-19 of the Unicode Standard.

7.2 Pair table based algorithm

A two dimensional table can be used  to resolve break opportunities between pairs of characters. The rows of the table are labeled by the possible values of the line breaking property of the leading character in the pair, the columns are labeled by the line breaking property for the following character of the pair. Each intersection is labeled with the resulting line breaking opportunity.

The Japanese standard JIS X 4051-1995 provides an example of such a table-based definition. However, it uses line breaking classes whose membership is not solely determined by line breaking property (as in this report), but in some cases by heuristic analysis or markup of the text.

7.3 Minimal table

If two rows of the table have identical values and the corresponding columns also have identical values, the two line breaking classes can be coalesced. The JIS standard uses 20 classes of which only 14 appear to be unique.

7.4 Extended context

By broadening the definition of pair from AB to ASp*B where A and B are characters and Sp* is a run of space characters, the same table can be used to handle cases where SPACE cannot provide a line breaking opportunity.

7.5 Customization

A real world line breaking algorithm must be tailorable to some degree. There are three principle ways of tailoring a table based algorithm:

1. Use a different line breaking table

2. Change the line breaking class assignment for some characters

3. Change the interpretation of the line breaking actions

The first method is the most obvious, but can be expensive to maintain, since there is no re-use of the unchanged behavior

The second is useful for cases where the line breaking properties of one class of characters are occasionally lumped together with the properties of another class to achieve a less restrictive line breaking behavior.

The third method is particularly useful if the behavior can be expressed by a change at a limited number of pair intersections. These intersections can be labeled with special values that cause different actions for different customizations.

7.6 Examples of customization:

1. Make alphabetic characters act as atomic

2. Make atomic characters act as alphabetic

3. Force a keep on Kana syllables, i.e. kyu, spelled KI yu would be kept together even though KI and yu are normally atomic

7.7 Simplification

Non-spacing marks are often not handled explicitly. If they are applied to a space character, naive algorithms will move them to the next line where they would possibly overhang the beginning margin. However, displaying non-spacing marks in isolation is rare in general purpose text. Applying them to characters with atomic line breaking property is equally rare, therefore, in the common case, simplified algorithms will still break most lines correctly.

Another simple technique is to simply scan past any combining marks, while remembering the property of the last non-combining character.


8.0 Further Information

Worldwide Typography and How to Apply JIS X 4051-1995 to Unicode, Michel Suignard, Proceedings of the Twelfth International Unicode/ISO 10646 Conference, Tokyo, Japan, 1998

Report from the Trenches: Microsoft Publisher goes Unicode, Cy Cedar, David Veintimilla, Michel Suignard and Asmus Freytag, Proceedings of the Eleventh International Unicode Conference, San Jose, CA 1997

The Unicode Standard, Version 2.0, Chapter 5: Implementation Guidelines, Addison Wesley, 1996


9.0 Acknowledgements

The initial assignments of properties are based on input by Michel Suignard. Mark Davis provided algorithmic verification. Ken Whistler, Rick McGowan and other members of the editorial committee provided valuable feedback.

10.0 Changes from previous revision:

Clarified that the focus in this report is on line breaking opportunities and non-opportunities, not on the aesthetic decisions of a formatter that might make tradeoffs when selecting among breaking opportunities.

Clarified the implied precedence of applying the line breaking properties.

Clarified that the 'spaces break after the last one' is implicit from the way spaces are measures and the intent of avoiding spaces at the beginning of the following line.

11.0 Copyright

Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports