Jump to content

Knuth–Plass line-breaking algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by The Anome (talk | contribs) at 14:16, 30 March 2024 (clarify). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The Knuth–Plass algorithm is a line-breaking algorithm designed for use in Donald Knuth's typesetting program TeX. It integrates the problems of text justification and hyphenation into a single algorithm by using a discrete dynamic programming method to minimize a loss function that attempts to describe the aesthetic qualities desired in the finished output.[1][2][3]

The algorithm works by dividing the text into a stream of three kinds of objects: boxes, which are non-resizable chunks of content, glue, which are flexible, resizeable elements, and penalties, which represent places where breaking is undersirable (or, if negative, desirable).[2]

Knuth and Plass' original algorithm does not include page breaking, but may be modified to interface with a page-breaking algorithm.[3]

The Knuth-Plass approach to line-splitting is a worst-case algorithm but often runs much faster. Solving for the Knuth-Plass optimum can be shown to be a special case of the convex least-weight subsequence problem, which can be solved in time.[4]

Simplified example

For the input text

AAA BB CC DDDDD

with line width 6, a greedy algorithm that puts as many words on a line as possible while preserving order before moving to the next line, would produce:

------    Line width: 6
AAA BB    Remaining space: 0
CC        Remaining space: 4
DDDDD     Remaining space: 1

The sum of squared space left over by this method is . However, the optimal solution achieves the smaller sum :

------    Line width: 6
AAA       Remaining space: 3
BB CC     Remaining space: 1
DDDDD     Remaining space: 1

The difference here is that the first line is broken before BB instead of after it, yielding a better right margin and a lower cost 11.

By using a dynamic programming algorithm to choose the positions at which to break the line, instead of choosing breaks greedily, the solution with minimum raggedness may be found in time , where is the number of words in the input text.

Typically, the cost function for this technique should be modified so that it does not count the space left on the final line of a paragraph; this modification allows a paragraph to end in the middle of a line without penalty. The same technique can also be extended to take into account other factors such as the number of lines or costs for hyphenating long words.[2]

Faster but more complicated linear time algorithms based on the SMAWK algorithm are also known for the minimum raggedness problem, and for some other cost functions that have similar properties.[5][6]

References

  1. ^ "The Knuth/Plass line-breaking Algorithm". defoe.sourceforge.net. The Folio Project. Retrieved 2024-03-30.
  2. ^ a b c Knuth, Donald E.; Plass, Michael F. (1981), "Breaking paragraphs into lines", Software: Practice and Experience, 11 (11): 1119–1184, doi:10.1002/spe.4380111102, S2CID 206508107.
  3. ^ a b Jonathan, Fine (2000). "Line breaking and page breaking" (PDF). TUGboat.
  4. ^ Wilber, Robert (1988-09-01). "The concave least-weight subsequence problem revisited". Journal of Algorithms. 9 (3): 418–425. doi:10.1016/0196-6774(88)90032-6. ISSN 0196-6774.
  5. ^ Wilber, Robert (1988), "The concave least-weight subsequence problem revisited", Journal of Algorithms, 9 (3): 418–425, doi:10.1016/0196-6774(88)90032-6, MR 0955150.
  6. ^ Galil, Zvi; Park, Kunsoo (1990), "A linear-time algorithm for concave one-dimensional dynamic programming", Information Processing Letters, 33 (6): 309–311, doi:10.1016/0020-0190(90)90215-J, MR 1045521.