Knuth–Plass line-breaking algorithm

The Knuth–Plass algorithm is a line-breaking algorithm designed for use in Donald Knuth's typesetting program TeX. It integrates the problems of text justification and hyphenation into a single algorithm by using a discrete dynamic programming method to minimize a loss function that attempts to quantify the aesthetic qualities desired in the finished output.^[1]^[2]^[3]

The algorithm works by dividing the text into a stream of three kinds of objects: boxes, which are non-resizable chunks of content, glue, which are flexible, resizeable elements, and penalties, which represent places where breaking is undesirable (or, if negative, desirable).^[2] The loss function, known as "badness", is defined in terms of the deformation of the glue elements, and any extra penalties incurred through line breaking.^[1]

Making hyphenation decisions follows naturally from the algorithm, but the choice of possible hyphenation points within words, and optionally their preference weighting, must be performed first, and that information inserted into the text stream in advance. Knuth and Plass' original algorithm does not include page breaking, but may be modified to interface with a page-breaking algorithm.^[3]

Computational complexity

A naive brute-force exhaustive search for the minimum badness by trying every possible combination of breakpoints would take $O(2^{n})$ time, and is impractical. The classic Knuth-Plass approach to solving the minimization problem is a worst-case $O(n^{2})$ algorithm but usually runs much faster in close to linear time.^[4]

Solving for the Knuth-Plass optimum can be shown to be a special case of the convex least-weight subsequence problem, which can be solved in $O(n)$ time.^[5]

Simplified example

For the input text

AAA BB CC DDDDD

with line width 6, a greedy algorithm that puts as many words on a line as possible while preserving order before moving to the next line, would produce:

------    Line width: 6
AAA BB    Remaining space: 0
CC        Remaining space: 4
DDDDD     Remaining space: 1

The sum of squared space left over by this method is $0^{2}+4^{2}+1^{2}=17$ . However, the optimal solution achieves the smaller sum $3^{2}+1^{2}+1^{2}=11$ :

------    Line width: 6
AAA       Remaining space: 3
BB CC     Remaining space: 1
DDDDD     Remaining space: 1

The difference here is that the first line is broken before BB instead of after it, yielding a better right margin and a lower cost 11.

By using a dynamic programming algorithm to choose the positions at which to break the line, instead of choosing breaks greedily, the solution with minimum raggedness may be found in time $O(n^{2})$ , where $n$ is the number of words in the input text.

Typically, the cost function for this technique should be modified so that it does not count the space left on the final line of a paragraph; this modification allows a paragraph to end in the middle of a line without penalty. The same technique can also be extended to take into account other factors such as the number of lines or costs for hyphenating long words.^[2]

Faster but more complicated linear time algorithms based on the SMAWK algorithm are also known for the minimum raggedness problem, and for some other cost functions that have similar properties.^[6]^[7]

References

^ ^a ^b "The Knuth/Plass line-breaking Algorithm". defoe.sourceforge.net. The Folio Project. Retrieved 2024-03-30.
^ ^a ^b ^c Knuth, Donald E.; Plass, Michael F. (1981), "Breaking paragraphs into lines", Software: Practice and Experience, 11 (11): 1119–1184, doi:10.1002/spe.4380111102, S2CID 206508107.
^ ^a ^b Jonathan, Fine (2000). "Line breaking and page breaking" (PDF). TUGboat.
^ "knuth-plass-thoughts/plass.md at master · jaroslov/knuth-plass-thoughts". GitHub. Retrieved 2024-03-30.
^ Wilber, Robert (1988-09-01). "The concave least-weight subsequence problem revisited". Journal of Algorithms. 9 (3): 418–425. doi:10.1016/0196-6774(88)90032-6. ISSN 0196-6774.
^ Wilber, Robert (1988), "The concave least-weight subsequence problem revisited", Journal of Algorithms, 9 (3): 418–425, doi:10.1016/0196-6774(88)90032-6, MR 0955150.
^ Galil, Zvi; Park, Kunsoo (1990), "A linear-time algorithm for concave one-dimensional dynamic programming", Information Processing Letters, 33 (6): 309–311, doi:10.1016/0020-0190(90)90215-J, MR 1045521.

External links

Breaking Paragraphs into Lines, the original paper by Knuth and Plass

This algorithms or data structures-related article is a stub. You can help Wikipedia by expanding it.

This typography-related article is a stub. You can help Wikipedia by expanding it.

[folio-1] "The Knuth/Plass line-breaking Algorithm". defoe.sourceforge.net. The Folio Project. Retrieved 2024-03-30.

[knuth-plass-2] Knuth, Donald E.; Plass, Michael F. (1981), "Breaking paragraphs into lines", Software: Practice and Experience, 11 (11): 1119–1184, doi:10.1002/spe.4380111102, S2CID 206508107.

[fine-3] Jonathan, Fine (2000). "Line breaking and page breaking" (PDF). TUGboat.

[4] "knuth-plass-thoughts/plass.md at master · jaroslov/knuth-plass-thoughts". GitHub. Retrieved 2024-03-30.

[5] Wilber, Robert (1988-09-01). "The concave least-weight subsequence problem revisited". Journal of Algorithms. 9 (3): 418–425. doi:10.1016/0196-6774(88)90032-6. ISSN 0196-6774.

[6] Wilber, Robert (1988), "The concave least-weight subsequence problem revisited", Journal of Algorithms, 9 (3): 418–425, doi:10.1016/0196-6774(88)90032-6, MR 0955150.

[7] Galil, Zvi; Park, Kunsoo (1990), "A linear-time algorithm for concave one-dimensional dynamic programming", Information Processing Letters, 33 (6): 309–311, doi:10.1016/0020-0190(90)90215-J, MR 1045521.

[1]

[2]

[3]

[4]

[5]

[6]

[7]