Line Breaking


Q: What is line breaking?

A: Computers need to have automated ways to determine where to break text into lines, so that text can automatically be wrapped into paragraphs. Note what happens if you change the width of this window in your browser. Parts of lines jump up or down into preceding and succeeding lines to keep the overall text within displayable margins. This happens as the result of an automatic process (an algorithm) that decides where lines should and should not break.

Q: Does the Unicode Standard have a specification for how to do line breaking for Unicode text?

A: Yes. Unicode Standard Annex #14, Unicode Line Breaking Algorithm specifies an algorithm for line breaking, generalized to handle all Unicode characters. A related data file provides all the character properties needed by that algorithm.

Q: To be compliant with Unicode, do I have to do everything in UAX #14?

A: No, there are many different ways to break lines of text, and the Unicode Standard does not intend to unnecessarily restrict the ways in which implementations can do this. [AF]

Q: Is Unicode Line Breaking Algorithm for all scripts and languages?

A: The algorithm is a carefully designed default, that will work well in many ordinary situations. However, more complex task like hyphenation are outside the scope. So are South East Asian scripts that don't use spaces and will need an add-on module that uses dictionaries (works like a hyphenator in some ways). Also, many typesetting styles will need specific tailoring, to fully match specific conventions expected by the users. [AF]

Q: So can I tailor everything?

A: No, because some characters are encoded solely or primarily for their line breaking behavior, their interpretation must be consistent with their semantics as defined by Unicode. A subset of the rules is specified as non-tailorable. For more info see Section 4 of UAX #14. [AF]