TB 68 Fine
TB 68 Fine
Jonathan Fine
203 Coldhams Lane
Cambridge, CB1 3HY
United Kingdom
[email protected]
http://www.active-tex.demon.co.uk
Abstract
In their seminal paper of 1981, Knuth and Plass described how to apply the
method of discrete dynamic programming to the problem of breaking a paragraph
into lines. This paper outlines how the same method can be applied to the
problem of page make up, or in other words breaking paragraphs into pages. One
of the key ideas is that there must be interaction between the line breaking and
page breaking routines. It is shown that TEX can, with one important limitation,
fully support such interaction.
This article also shows how TEX can, by using a custom paragraph shape
and a special horizontal list, suppress hyphenation of the last word on a page.
210 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
We use ASCII as a convenient shorthand for the in- glue on the horizontal list. This produces an un-
put text file. breakable interword space.
Tokens Internally, TEX deals with tokens. Cate- Horizontal list TEX would not be able to typeset
gory codes control the translation of input charac- without commands that placed items on the hori-
ters into tokens. zontal list. The internal token ‘the letter a’ (ob-
Traditionally, the ASCII character ‘a’ is given tained say by reading an ‘a’ from the input ASCII
category code letter, which means that when read file) will place a ‘character box’ that is the charac-
it become the token ‘the letter a’. The same goes ter ‘a’ in the current font onto the current horizon-
for the other letters. Digits and punctuation are tal list. (This is only in horizontal mode. In math
given category other, and so the digit 7 become ‘the mode it does something else.) The internal token
character 7’ when read. ‘the character 9’ behaves in the same way.
Certain other symbols, such as {, } and \, have There are other items that can go on the hor-
special category codes. It is this that gives TEX its izontal list. For this article, we need to know only
familiar ‘backslash and braces’ input syntax. How- about glue, penalties and discretionary penalties.
ever, this syntax is not built into TEX the program. Glue is potentially stretchable and shrinkable inter-
Internally the only tokens that TEX has are word space, while penalties record the undesirability
character tokens (of various categories), and con- of making a line break at this point.
trol sequences. (A control symbol is a control se- Discretionary hyphens are hyphens that are op-
quence whose name is a single character, usually a tional. The line breaking algorithm can break lines
non-letter.) The traditional category codes cause at discretionary hyphens. If the break is taken at a
the ‘eyes of TEX’ to convert the sequence of charac- discretionary hyphen, the hyphen appears, and oth-
ters \wibble to the control sequence whose name is, erwise nothing appears. Discretionary hyphens can
well, wibble. This is done by TEX the program, as be placed onto the horizontal list either explicitly,
part of the process that turns ASCII characters into via the execution of a primitive command, or im-
tokens. plicitly, as a result of the hyphenation algorithm.
In Active TEX, every input ASCII character is
an active character. An active character is rather Lines TEX’s line breaking algorithm turns a hori-
like a control sequence, in that it has a meaning, and zontal list into a sequence of lines. It does this by
this meaning can be changed at any time. However, choosing a sequence of break points in the horizontal
its ‘name’ is the active character ‘x’, or whatever it list. Most of the time, any glue and penalty items
is. In plain TEX, the ‘~’ character is active. after a chosen break point are discarded. This allows
Active TEX does not use the ‘eyes’ of TEX the the interword glue to disappear at line breaks.
program to form control sequences. Instead, it uses Normally, TEX breaks the paragraph into lines
macros and the \csname primitive to form control using the current value of the \hsize. However, the
sequences out of the active characters that it receives \parshape parameter allows the width (and offset)
from the eyes of TEX. This means that it never has of each line to be specified individually.
to change category codes, in order to achieve special Vertical list After the paragraph has been broken
effects, such as verbatim typesetting. into lines, TEX places the lines onto the current ver-
Macros The internal tokens of TEX (or more ex- tical list. Often, this vertical list is the main vertical
actly their meanings) can be divided into two classes, list, also known as ‘the current page’. Each line of
namely the expandable and the unexpandable. Most the paragraph is a box (in fact a horizontal box).
expandable tokens are macros, and most of the prim- As well as boxes, a vertical list can contain (verti-
itive commands of TEX are unexpandable. How- cal) glue and (vertical) penalties. A vertical list can
ever, some primitive commands, such as \ifx, the also contain other items, such as insertions, that do
other conditional commands and \csname, are ex- not concern us here.
pandable. The line breaking algorithm places (vertical)
Unexpandable commands do something (in the glue between the lines, so that the baseline to base-
stomach of TEX the program), while expandable com- line distance between the lines is uniform (unless
mands and macros control what it is that is done. the lines contain exceptionally tall or deep set mat-
In plain TEX the ‘~’ character, which is active, is de- ter). It also inserts (vertical) penalties between the
fined to be a macro that places a penalty and some lines, to aid in the page breaking process. The
\clubpenalty is the extra penalty for a page break
immediately after the first line of a paragraph. The
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 211
Jonathan Fine
\widowpenalty is the extra penalty for a page break rithm, but has little to say on the internals of such
immediately before the last line of a paragraph. an algorithm. As in the line-breaking algorithm,
the page-breaking algorithm selects one sequence of
Pages Whenever something is placed on the main possibilities from the many presented to it
vertical list, TEX the program checks to see if it has The line-breaking algorithm has look-ahead. Its
accumulated enough to break off from it the cur- context is the current paragraph. To avoid hyphen-
rent page. If it has, then TEX chooses the best of ating the last word on the last line of a page, the
the available break points on the main vertical list. algorithm needs to know where that last line will fall
It then calls a special token list, the \output rou- (unless it suppresses all hyphenation, and so is done
tine, to add page numbers and the like to the broken with the problem). Therefore, the page-breaking al-
off portion, and to ship it out to the dvi file. The gorithm will have to feed information back to the
\output routine is part of the macro package. line-breaking algorithm.
The dvi file This is the output file produced by Once the location of the page break is known,
TEX the program. As well as recording the place- this information can be fed to the line-breaking al-
ment of every character and every rule on the page, gorithm (in the form of a custom paragraph shape).
it can contain what are known as \special com- Provided a suitable horizontal list is constructed, the
mands. Programs that process dvi files can read algorithm will suppress hyphenation at the required
the specials, and use them as parameters to their point. How this is done will be shown later in this
actions. For example, a special might request the article.
placement of a graphic. Discrete dynamic programming
The look-ahead problem TEX’s line breaking al- The purpose of this section is to describe those parts
gorithm ‘looks-ahead’ to the end of the paragraph of TEX’s line-breaking algorithm that are specially
before it makes any decisions as to where the first relevant to this article. This has two aspects. The
(or any other) line break occurs. Each line break first is those features that are relevant to suppres-
is, so to speak, considered not by itself but in the sion of hyphenation of the last word on some spec-
context of the other line breaks. ified line of a paragraph. The second is those fea-
The page breaking algorithm does not perform tures that help us to understand what can be done
such a look-ahead. Each page break is considered in for global optimisation of page breaks, and for es-
isolation, without regard for its consequences later tablishing communication between the line-breaking
in the document. and page-breaking algorithms.
At the end of a paragraph, the line-breaking In our simplified model, a horizontal list con-
algorithm is called, and it produces lines of text. tains character boxes, glue, penalties and discre-
These lines are then placed on, say, the main vertical tionary hyphens. Glue and penalties are what are
list. If enough material has accumulated, the page- known as discardable items. They can disappear at
breaking algorithm cuts off enough material for one a line break. The other items are non-discardable.
page, and the output routine is called. They will never disappear.
Thus, from the end of the paragraph to the call- A legal breakpoint is any (finite) penalty, any
ing of the output routine, everything is under the discretionary hyphen, and any glue item, provided
control of TEX the program. During this time nei- the glue is immediately preceded by something that
ther the user nor the macro programmer has any op- is non-discardable. For any sequence of breakpoints,
portunity to influence TEX’s behaviour, other than there is quantity called the total demerits, that de-
through the values of parameters and the contents pends on both the chosen breakpoints and on pa-
of the horizontal and vertical lists. rameters that can be set by the macro programmer.
TEX’s page-breaking algorithm clearly is defi- For example, when breaking at a penalty, the
cient for complex work. One needs to be able look amount of the penalty is part of the sum that is
ahead, when there is floating matter to be placed. the total demerits. Similarly, the \hyphenpenalty
Multiple column layout is particularly complicated. and \exhyphenpenalty parameters are the contri-
There are two aspects to the problem. The first is butions made by discretionary and explicit hyphens
that an improved algorithm requires more than the respectively. If the line had been set loose or tight
information local to the current page. The second (shorter or longer than its optimum width) then a
is what it does with this information. badness for the line contributes to the total demer-
This article concentrates on making informa- its.
tion available to an improved page-breaking algo-
212 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
Of all the possible sequences of breakpoints for As this process continues, so the number of both
a given paragraph, TEX chooses one that has the feasible and locally optimal sequences will in general
smallest possible value for the total demerits. It does grow. However, the growth will not be too rapid.
not choose the breakpoints line by line, or in other Consider the spread in the location of the breakpoint
words locally. The breakpoints are chosen with a that is the end of, say, the nth line. If the first n
view to the whole paragraph, or in other words glob- lines contain as little set matter as is possible, then
ally. we get one location in the horizontal list. If they
The way in which it does this is interesting, contain as much as is possible, we get another. This
because in general there are so many possible se- is the spread. It is roughly linear in n. The number
quences of breakpoints, that it is impossible for them of breakpoints in this spread is the number of locally
to be considered individually. The method used optimal breakpoints that the algorithm must carry
is known as discrete dynamic programming. This along to the n + 1 stage.
method allows the last line of a paragraph to ‘com- This analysis limits the running time to of the
municate’ with the first. (It is communication not order of n2 . However, we can do better. When
in the sense of sending a message, but in the sense the spread gets large, it will cover the the length
of being part a common larger whole.) of a whole line, and so some of the calculations for
To save time, TEX tries first to make the para- n + 2 will have been done as part of n + 1. This
graph without using any hyphenation. The param- also shows why using a custom paragraph shape is
eters \pretolerance and \tolerance are limits on computationally expensive. There is no longer such
how bad a line can be respectively before and after a sharing of computations between lines.
hyphenation. For simplicity, we will assume that The line-breaking and the page-breaking algo-
hyphenation is alway tried, say because the pretol- rithms have a certain amount in common. This is
erance is zero. how Don Knuth puts it in The TEXbook (page 100):
A sequence of breakpoints is said to be feasi- TEX breaks lists of lines into pages by com-
ble if no line has badness exceeding the tolerance. puting badness ratings and penalties, more
The line-breaking algorithm considers only feasible or less as it does when breaking paragraphs
sequences of break points. For formal reasons, the into lines. But pages are made up one at a
end of the paragraph is considered to be a break- time and removed from TEX’s memory; there
point. It is, after all, the end of a line. is no looking ahead to see how one page break
The formula the algorithm uses to compute the will affect the next one. In other words, TEX
total demerits has the following useful property. Sup- uses a special method to find the optimum
pose an optimal sequence of breakpoints is selected, breakpoints for the lines in an entire para-
and say lines 5 to 9 are of the horizontal list are con- graph, but it doesn’t attempt to find the op-
sidered in isolation from the remainder of the hori- timum breakpoints for the pages in an en-
zontal list. The optimal sequence of breakpoints for tire document. The computer doesn’t have
the whole paragraph, when restricted to the isolated enough high-speed memory capacity to re-
lines, is also optimal for the line-breaking problem member the contents of several pages, so TEX
represented by the isolated problem. This is called simply chooses each page break as best it can,
the property of locality. It is a property of the for- by a process of “local” rather than “global”
mula for total demerits. optimisation.
Discrete dynamic programming, as applied to
line-breaking, consists of the following. Start at the The situation is not impossible though. In Ap-
beginning of the paragraph. Calculate the feasi- pendix D (page 400) Don Knuth writes:
ble breakpoints for the end of the first line. From An output routine can also write notes on a
these breakpoints calculate the feasible breakpoints file, based on what occurs in a manuscript. A
for the end of the second line. We now prune the list two-pass system can be devised where TEX
of feasible sequences of breakpoints. If two or more simply gathers information during the first
sequences end the second line at the same point, pass; the actual typesetting can be done dur-
keep only the best one. (If several are joint first, ing the second pass, using \read to recover
keep only one.) For each of the remaining two-line information that was written during the first.
breakpoint sequences, compute all the feasible ex- Provided sufficient information can be gathered
tensions to three-line sequences, and prune as be- in the first pass, it can then be presented to TEX’s
fore. line-breaking algorithm, or some other program, so
that an optimal choice can be made from amongst
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 213
Jonathan Fine
those which are feasible. The second pass can then of glue and penalty items will vanish as
do the actual typesetting. a unit, if no boxes intervene, unless the
optimum breakpoint sequence includes one
Avoiding ‘last-word’ hyphenation or more of the penalties.
In this section we explain how a suitable horizon-
tal list and paragraph shape specification taken to- In other words, most of the time discardable
gether will cause the line-breaking algorithm to sup- items are discarded, but any (finite) penalties are
press hyphenation of the word at the end of some allowed to be part of the breakpoint sequence, if that
specified line. is what the algorithm decided to do. In other words,
The basic idea is quite simple. Hyphenation when moving on to the next feasible breakpoint, it
places matter on the next line. Indeed, this is the has something of a free choice in the discarding of
very purpose of hyphenation. However, if the next discardables.
line is not long enough to hold even the smallest Therefore, each piece of ‘glue’ between words
fragment of a word, then the word at the end of the will have to contain two legitimate break points, as
previous line will not be hyphenated. (It is possi- well as an ordinary piece of interword glue. The way
ble for a very long word to be hyphenated two or to get this is to place two penalties of zero, followed
more times. Each hyphenation point is a legitimate by the ordinary interword glue. (The penalty for
breakpoint.) breaking at glue preceded by a non-discardable, such
as a word, is zero. Thus, in ordinary cases we get
The sixth line has zero width. This pre- the same behaviour as before.)
vents hyphenation of the word at the e- Something similar arises in ordinary practice.
nd of the fifth line. This is because whe- Sometimes a line is deliberately left short, say be-
n a word is hyphenated, part of the wor- cause the next word is too long to fit on the line,
d is placed on the next line. (This is the and it cannot be hyphenated. The standard way
to achieve this is to insert \hfil \break in the line.
very purpose of hyphenation.) A specia- The \break is just a shorthand for a penalty of zero,
l sequence of items of glue and penalti- and the \hfil is glue that stretches to fill the line.
es is placed between words. This allow- When the line has zero width, no glue is required to
s the interword glue to span the zero-w- fill it.
idth line. In August 1999, the author posted to the news-
group comp.text.tex example code that suppressed
Figure 1: Example of suppressed hyphenation hyphenation. A lively debate followed, but not until
the author came to write this article did he discover,
We can achieve this effect on say the fifth line to his shock and horror, that the code he posted last
by making the width of the sixth line equal to zero. summer did not work in many cases. In the first ver-
This however creates a problem. If we use ordinary sion of this paper, his solution had an unnecessary
glue between words, then between any two words but harmless zero-width piece of glue between the
there will be only one breakpoint, namely the glue two penalties. This was not noticed until after the
that was between the words. For some word to be paper had been refereed. Clearly, some of us have
allowed to occur at the end of the fifth line, it must something to learn about penalties and glue.
be followed by a special piece of ‘glue’, that is capa-
ble of spanning the zero width sixth line. The echowords environment
Recall that in our simplified model (which is Figure 1 shows the result of applying the methods
all we need), breaks can occur at penalties, at dis- of the previous section. So that there are many hy-
cretionary hyphens, and at glue that is preceded by phens, a discretionary hyphen has been placed be-
something that is not discardable. To go further, we tween adjacent letters of a word. The spaces be-
need to understand exactly what happens at a line tween words contribute, as described in the previ-
break. ous section, two penalties of zero and an ordinary
According to The TEXbook (page 97): interword space. Although it is clearly possible to
When a line break actually does occur, construct such a horizontal list by hand, doing so is
TEX removes all discardable items that laborious and prone to error.
follow the break, until coming to something Instead, the author has used Active TEX to sim-
non-discardable, or until coming to another plify the form of the example’s input. In fact the
chosen breakpoint. For example, a sequence author wrote
214 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 215
Jonathan Fine
]] %% now back to the usual catcodes causes all symbols to call get.word. In short, all
visible characters are to call the get.word command
We will now explain what is going on. First, the we just defined.
.latex command. This command opens a group, in The .latex command ‘owns’ the active end-of-
which all ASCII characters are active. It looks for a line character. Only when one knows for sure that
line that begins with the (active) characters \end{. it is safe to do so, should one change its meaning.
When it finds this, it closes the all-active group, and It is used by .latex to inspect the next line for the
pretends that it had read the \end{ with LATEX’s \end{ characters.
normal category codes. Thus, the first and last of At the start and end of each non-blank input
the input lines line, .latex generates .rs and .re events. Blank
input lines generate the .rs-re event. These events
\begin{echowords} are control sequences, whose values can be set by
These words are echoed,
the macro programmer. Here we are setting them
one by one.
to do nothing. (One can think of the visible char-
\end{echowords}
acters as similarly being events, but this time with
are processed by the begin and end commands of parameters.)
the echowords environment. The two lines in the We have now initialised all the ASCII characters
middle are read and processed with all characters except for space and tab. The next line sets them
active, and with the values set by the environment. both to relax. (The construction |ABC generates
We simplify slightly. To avoid needlessly filling a character whose category code is hexadecimal A,
TEX’s macro processor (the mouth) with a long list and whose character code is hexadecimal BC. Thus,
of tokens, this looking for the \end{ is done on a |D09 is active tab.)
line by line basis. However, this makes no difference The low-level events (reading a character from
in practice. the input stream) have now been dealt with. They
The first three assignment commands tell Ac- create higher level events, namely the initialisation
tive TEX how to process letters (upper- and lower- of the parsing of a word, and the processing of the
case) and digits. Symbols are rather different. Most word once parsed. The .string.visible macro is
if not all of the time, all lowercase letters are dealt a low-level system macro that causes all visible char-
with by the same rules. The same goes for upper- acters to behave as if they were characters of cate-
case letters, and for digits. It often happens, how- gory code other. This system macro by-passes the
ever, that each symbol has a specific meaning. For symbol set mechanism. It runs quicker, but must be
example, in the atcode environment, each symbol used with care.
has a distinct meaning of its own. We are almost done. There are some commands
For this reason, Active TEX uses the concept of in get.word that need explanation. The command
symbol sets. Within its realm, it ‘owns’ all the ac- .suspend.white.space cause the active form of the
tive symbols. (This is done in a way that does not white space characters (space, tab and end-of-line)
interfere with their use outside of its realm.) Instead to expand to .suspend followed by the active white
of directly assigning a value to a symbol, one selects space character. This should only be done within a
a symbol set, perhaps of one’s own creation. The group, which is closed by white space. The parsing
owner of a symbol set is free to change the meaning of a word is exactly such a context.
of symbols in that set. For as long as that symbol set Finally, the atlatex package contains a helper
is selected, for almost all practical purposes chang- macro that is very useful for closing a ‘flying xdef’.
ing the meaning of a symbol in a set is the same Here is its definition.
as changing the meaning of the active symbol itself. def .end.xdef
However, when a different symbol set is selected, the { iffalse { fi ; } ; endgroup } ;
meaning of all the symbols changes to those of the To conclude, we reconsider the get.word macro.
newly selected set.
Parentheses, as above, are used to select a sym- def get.word
bol set. The (default) symbol set is part of the {
atcode package, and in it every symbol expands to begingroup ; aftergroup do.word ;
the control sequence .symbol, followed the active init.get.word ;
symbol itself. Thus the line of code: .suspend.white.space ;
let .suspend .end.xdef ;
(default) ; let .symbol get.word ; xdef the.word { iffalse } fi ;
216 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
} }
It opens a group. After the group, we call The hyphdemo environment takes a single pa-
do.word. The variable part of the initialisation rou- rameter, namely the number of the line, at the end of
tine, namely init.get.word is defined to make all which hyphenation is to be suppressed. This param-
visible characters ‘other’. Thus, when the xdef which eter controls the construction of a custom parshape,
closes the macro executes, it simply accumulates vis- which will be coded later. If the parameter is zero,
ible characters in the.word. (The iffalse is a hack no suppression is offered.
that allows a macro to ‘contain’ unbalanced braces.) The parameters encourage hyphenation. The
The two suspend commands cause white space to large value of the line penalty is to stop the line-
close the xdef, and thereby trigger the processing breaking from making the lines very loose, just so it
of the word. can get the reward (negative penalty) for the addi-
The reader may find it instructive to run these tional hyphen.
macros with tracingall on, and examine the re- newenvironment { "hyphdemo" } [1]
sulting log file. {
The hyphdemo environment par ;
begingroup ;
Our next example is more substantial. The sup- hyphenpenalty "-100" ;
pression of hyphenation on the last line of a page doublehyphendemerits "0 " ;
requires the construction of a fairly special horizon- linepenalty "200 " ;
tal list. Here is the sequence of penalties and glue leftskip "2pc " ;
that is to be placed between words. (We use hd as a rightskip leftskip ;
two letter prefix for ‘hyphenation demonstration’.) hd:set.parshape { #1 } ;
But first we enter the atcode environment.
\csname @code\endcsname let .lcletter hd:get.word ;
def hd:iwspace let .ucletter .lcletter ;
{ let .digit .lcletter ;
unskip ; (default) ; let .symbol .lcletter ;
penalty "0 " ; penalty "0 " ; let ! hd:iwspace ;
~ ; // ordinary interword space let |D09 ! ; let .re ! ;
} let init.get.word hd:init.get.word ;
The unskip is in case we get two spaces in a def .re-sp { par } ;
row. This is not rigorous, but in the context it is def do.word { the.word } ;
good enough. Then we put down two penalties, .latex ; // don’t forget this
which allows hd:iwspace to span a blank line. Fi- }
nally, we put down an ordinary piece of interword { par ; endgroup }
glue. In Active TEX, ~ produces an ordinary space The remainder of the definition of this environ-
character. ment sets up the conditions for the parsing and pro-
So that we get lots of hyphens, we will place a cessing of words, in much the same way as in the
discretionary hyphen between adjacent letters in a previous section. Note that let do.word the.word
word. To do this, we use a variant of the get.word would be very wrong. This would cause the macro
command. This macro applies string to the first to continually process the value of the.word that
character in the word. Each subsequent character was current at the start of the environment.
is then responsible for putting down a discretionary The difficult part of setting the parameters is to
hyphen before stringing itself. To avoid hyphen- feed the parameters to TEX’s parshape primitive. It
ating just before punctuation at the end of a word, takes 2n + 1 parameters, where n is the number of
symbols do not insert a discretionary hyphen. lines, whose width we are specifying. These are TEX
def hd:get.word { get.word ; string } ; number and dimension parameters, and not macro
def hd:init.get.word or token parameters. We use aftergroup accumula-
{ tion to build up this list. Scratch counters are used
def .lcletter { \- ; string } ; to hold the values of parameters whose values have
let .ucletter .lcletter ; to be calculated.
let .digit .lcletter ; def hd:set.parshape #1
let .symbol string ; {
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 217
Jonathan Fine
218 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
formula for total demerits. One consequence of this 5+5->5050 ; wibble 5050, 1 ;
property is the following. If an optimal sequence of Given such a sequence of paragraph reports,
breakpoints takes in the feasible breakpoints A and and the requirement that there be, say, exactly 12
B, then over the range A to B this sequence is also lines on each page, there is an associated optimi-
optimal for this local form of the problem. sation problem. First, for each paragraph report
Now suppose A is ‘sufficiently distant’ from the choose one of its entries. Call this a selection (of
start of the paragraph, and that B is ‘sufficiently paragraphs). Write the selection in the form
distant’ from A. Here, ‘sufficiently distant’ means
that there is a feasible (but not necessarily optimal) (5) + (3) + (4 + 7) + (5 + 2) + 10 + . . .
sequence of breakpoints linking the two points. If
and say that the selection is feasible if, when sum-
enough set matter of random width intervenes be-
ming from left to right, successive exact multiples
tween the two points, the concept has its intuitive
of 12 are reached during the progress of the sum.
meaning. In these circumstances the line-breaking
The above selection is feasible (as far as it goes).
algorithm will find an optimal sequence of break-
For every feasible selection, define the grand total
points between A and B. It will do this whether
demerits to be the sum of demerits associated with
or not this is part of the finally chosen optimal se-
the terms of the form (a+b). Thus, the (4+7) terms
quence.
contributes 4070 to the grand total demerits.
Thus, the line-breaking algorithm finds not only
The optimisation problem is to find a feasible
the best breaking for the whole paragraph, but also
selection that minimises the grand-total demerits.
for a great many portions of the paragraph. In the
This is one way (there are many others) of defining
same way, any worthwhile discrete dynamic pro-
a global optimisation for the line and page breaks
gramming solution to the problem of global opti-
of a document. If such a problem is to be solved
misation will consider most or all possible feasible
using discrete dynamic programming, the global op-
ways of breaking the paragraphs that constitute the
timisation data might take a more elaborate form,
document. The strength of Knuth and Plass’s al-
but the general structure will be the same. (The
gorithm is not that it runs quickly in abstract, but
interested reader might wish to look at how TEX’s
that the running time is roughly linear, rather than
line-breaking algorithm supplies demerits for adja-
quadratic, in the size of the problem. Because of
cent lines whose looseness is visually incompatible.
linearity, in time hardware will be able to catch up,
It is done by providing each partial problem with a
even if the problems are large.
context.)
Global optimisation It is both interesting and fortunate that the
global problem, as described above, can be solved
This section describes briefly how a report on para-
using TEX’s line-breaking algorithm. It is a mat-
graph, as in the previous section, can be used as the
ter of ‘putting the book on its side’, and thinking of
input for a global optimisation process. For simplic-
each line as a ‘word’ in a paragraph. The problem is
ity, we assume that we are setting straight text on
to construct a suitable list of boxes, glue and penal-
a grid, and that hyphenation is to be suppressed on
ties. So that we can get nice diagrams, we will let
the last word of each page. We also assume that no
one pica represent one line.
paragraph is longer than a page, or in other words,
Discardable items can vanish at line breaks, and
that it cannot span two page breaks.
with trickery this allows the problem to be solved.
First, it is convenient to recast each paragraph’s
Consider for example the sequence of horizontal list
report into the following form. We give the possi-
items,
bilities in order first of the number of lines before
the potential page boundary, and then in order of penalty "4070 " ;
the number after. Thus, a fragment of paragraph’s kern "-2pc " ;
report might look like the following. (The values for noalign {} ;
total demerits are fictional, and are chosen to make kern "2pc " ;
the rest of the exposition clearer. The right hand Kerns are discardable items. If the line break
column will be explained later.) is taken at the penalty, the first kern will be dis-
4+5->4050 ; wibble 4050, 0 ; carded. The noalign is non-discardable, and it pre-
4+6->4060 ; wibble 4060, 1 ; vents the second kern from being discarded. Thus,
4+7->4070 ; wibble 4070, 2 ; if the penalty is not a break-point, the kerns can-
; wobble ; cel, but if the penalty is a break point, it effectively
5+4->5040 ; wibble 5040, 0 ; inserts a kern of two pica.
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 219
Jonathan Fine
Denote such a sequence of horizontal list items work, although it can readily typeset the paragraphs
by wibble 4070, 2;, and let wobble represent a that will go into the pages. As in shown in the pre-
kern by one pica. This procedure translates the vious section, it is possible to use the line-breaking
sequence of paragraph reports into a horizontal list. algorithm to solve simple page make-up problems.
When the line-breaking algorithm is applied to this For more complicated problems, an external pro-
list, with a hsize of twelve pica, the result is a global gram might be more suitable.
optimisation of the line and page breaks. If we are TEX is not good at complicated page makeup,
not typesetting on a grid, then some ‘interword glue’ but that is no reason to ‘improve’ it. Complicated
(representing interline glue) should be added to the page make-up was never a design goal of TEX. In-
above construction. stead, TEX can be used to feed paragraphs and para-
As mentioned earlier, the line-breaking algo- graph reports to an external make-up program. Such
rithm introduce penalties between the lines, in order can be thought of as an improved device driver, in
to help TEX’s page-breaking algorithm. These pre- the same way as Active TEX is intended to be an
vent, or discourage, page breaks just after the first improved macro package.
line of a paragraph, and just before the last line.
In the algorithm described here, the potential page Postscript
break is part of the paragraph’s specification. The Prior to the TUG 2000 meeting I sent an earlier
club penalty thus becomes an extra demerit charged version of this paper to Don Knuth, and invited his
for paragraph specifications of the form (1 + n), and comments. He told me that I should cite and read
similarly the widow penalty applies to (n + 1). Michael Plass’s thesis [10]. The citing is done, and
Although there are some difficulties of a tech- I hope soon to read this work. He also says that he
nical nature in implementing such a solution, there cannot add \totaldemerits, as that would mean
is a more fundamental problem. In 1989 [6], when changing TEX and suggests instead that I approach
Don Knuth released version 3 of TEX, he introduced the authors of extensions to TEX.
several new primitives. One of them, the \badness, In his essay on the errors of TEX, [7] Don Knuth
records the badness of the box that was most re- wrote:
cently constructed. Thus, this quantity is made
available to the macro programmer. Sadly, he did Of course I don’t mean to imply that all prob-
not at the same time introduce \totaldemerits, lems of computational typography have been
and so there is no ready access to this quantity. solved. Far from it! There are still countless
important issues to be studied, relating espe-
Summary and conclusions cially to the many classes of documents that
When Don Knuth announced [8] in 1990 that his go far beyond what I ever intended TEX to
work on developing TEX had come to an end, he handle.
pointed out that improved macro packages could I hope that this article shows that a few judi-
be added on the input side, and improved device cious extensions to TEX will produce a new system
drivers added on the output side. This article shows that can handle well many new classes of documents,
that ten years after the event, there is still plenty of and that even TEX can make a fair attempt at doing
room for improvement on the macro package side. the job. What seems to be required, above all, is an
(However, the lack of a \totaldemerits command understanding of the problem, and the development
is unfortunate.) of suitable algorithms. From then on, the program-
The problem of suppressing hyphenation at the ming of the extensions should be straightforward.
end of a page is relatively simple, particularly if a The article by Frank Mittelbach in these pro-
macro package such as Active TEX is used to con- ceedings addresses a different aspect of the page
struct the horizontal list. What has not been dis- makeup problem. His concern is with placement of
cussed is how to rearrange the resulting sequence of floats. Combining his work with mine, even at the
lines, so that the blank amongst them can be dis- level of algorithms, is already a challenge. When it
carded. In abstract this is not difficult, but in the comes to implementation, the widespread use of ac-
context of an existing macro package one may find tive space characters is likely to present LATEX with
assumptions being made that are inconsistent with many problems. Assumptions about category codes
this goal. are built into its input syntax.
The problem of page make-up is much harder, So much for output. On the input side the pa-
particularly where there are multiple columns and pers by David Carlisle and by Pedro Palao Gostanza
floating material. TEX was not designed to do such in these proceedings have significant overlap with
220 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting
Line breaking and page breaking
this paper. I am delighted that others are taking I offer these packages to the community, and
steps in the direction of making all characters active. hope for the rapid and widespread adoption of a
However, we now have three incompatible systems standard for the use of active characters. I would
of values for the meaning of active letters and digits. of course prefer that my own macros were the stan-
Active TEX provides a powerful and effective dard, but more important both to me and to the
programming environment, especially for defining community as a whole, I believe, is that a standard
active characters. Without such a device, the pro- acceptable to all is adopted.
grammer has to resort to ad hoc tricks, time and
time again. For example, of the 1,936 lines of xml- References
tex (v0.07), exactly 194 contain the string catcode. [1] David P. Carlisle, xmltex: A non validating (and
By contrast, of the 6,361 lines of my sgmlbase pack- not 100% conforming) namespace aware XML
age (v0.00), exactly 23 contain the string catcode. parser implemented in TEX, these proceedings
Perhaps if Carlisle had used Active TEX, his work [2] Jonathan Fine, Active TEX and the DOT input
would have been easier. syntax, TUGboat, 20, (1999), 248–254
For this area to flourish, standards are required. [3] Pedro Palao Gostanza, Fast scanners and self-
Without standards, incompatible versions of the ba- parsing in TEX, these proceedings
sic macros will be re-invented. Application program-
[4] Donald E. Knuth, Michael F. Plass, Breaking
mers will then have to work harder, to cope with this
paragraphs into lines, Software — Practice and
unhelpful diversity. There is also be the danger of
Experience, 11 (1981), 1119–1184.
schisms within the community.
To understand this, imagine what life would be [5] Donald E. Knuth, The TEXbook, Addison-
like if there we used incompatible mechanisms for Wesley (1984).
register allocation (\newcount and the like). In The [6] , The new versions of TEX and META-
TEXbook (page 346), Don Knuth addressed pre- FONT, TUGboat, 10 (3) (1989), 325–328
cisely this problem: [7] , The Errors of TEX, Software — Practice
Allocation of registers. The second major and Experience, 19 (1989), 605–685; reprinted
part of the plain.tex file provides a founda- with additions and corrections as Chapter 10 of
tion on which systems of independently de- Literate Programming.
veloped macros can coexist peacefully with- [8] , The future of TEX and METAFONT,
out interfering in their usage of registers. TUGboat, 11 (4) (1990), 489.
We need the same for active characters. The [9] Frank Mittelbach, Formatting documents with
packages atcode.sty and atlatex.sty have been floats, these proceedings
written to be a fixed point that opens this area to [10] Michael F. Plass, Optimal Pagination Tech-
the plain and LATEX macro programmer. They differ niques for Automatic Typesetting Systems,
only in a small but significant detail (colon instead Ph.D. thesis, Stanford University (1981). Pub-
of prefix is used to segment the name space) from lished also as Xerox Palo Alto Research Center
the version announced at TUG 1999. report ISL-81-1
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 221