13 STS452
13 STS452
167
168 J. M. CHAMBERS
way. Both paradigms are valuable for serious program- Other functions and other objects can adapt to different
ming with the language. But in both cases, understand- models in a form that is convenient for both the user
ing the relevant ideas in the context of R is needed to and the implementer.
avoid confusion. The confusion sometimes arises, in Principles of functional programming guide us in
both cases, from applying to R interpretations of the writing reliable, reproducible functions for the dif-
paradigms that apply to other languages but not to this ferent models. Object-oriented programming provides
one. Section 2 of the paper will review the ideas, gener- tools for defining the model objects clearly, and adapt-
ally and in their R versions, with the goal of clarifying ing to new ideas and new forms of models. Section 3.4
the basics. Given the importance of R software to the goes into details of the R implementations.
community, creators of new R software should benefit As they have been realized in R, both paradigms cen-
from understanding these concepts. ter on a few, intuitive concepts. The details are more
We will also examine in Section 3 of the paper the complicated, as they usually are. In the case of func-
evolution that led to these versions of functional pro- tional programming, the realization in R is only par-
gramming and OOP. The prime motivation was not lan- tial, reflecting the language’s origins as well as practi-
guage design in the abstract but to provide the tools cal considerations. In the case of OOP, there are now
needed for research and data analysis by the user com- at least three realizations of the ideas in R, using two
munity at the time. R originally reproduced the func- different paradigms. All three have significant applica-
tionality of the S language at Bell Labs, which itself tions and practical value.
had evolved through several stages beginning in the Despite all these devilish details, the main ideas re-
late 1970s and which was in turn based on earlier sta- main visible and useful, particularly when program-
tistical software libraries, mainly in Fortran. ming serious applications using the language.
R added important new ideas and has continued
to evolve, but the main contents inherited through S 2.1 Functional Programming
shaped the capabilities and the approach to statistical For our purposes, the main principles of functional
computing. In a surprising number of areas, what we programming can be summarized as follows:
think of as “the R way” of organizing the computa-
tions actually reflects software developed twenty years 1. Programming consists largely of defining func-
or more before R existed. tions.
Having been involved in all the stages, I am naturally 2. A function definition in the language, like a func-
inclined to a historical perspective, but it is also the tion in mathematics, implies that a function call returns
case that the history itself had substantial impact on a unique value corresponding to each valid set of argu-
the results. It may be comforting to view programming ments, but only dependent on these arguments.
languages as abstract definitions, but in practice they 3. A function call has no side effects that could alter
evolve from the needs, interests and limitations of their other computations.
creators and users. The implication of the second point is that functions in
the programming language are mappings from the al-
2. FUNCTIONAL AND OBJECT-ORIENTED lowed set of arguments to some range of output values.
PROGRAMMING: THE MAIN IDEAS
In particular, the returned value should not depend on
Functional and object-oriented programming fit nat- other quantities that affect the “state” of the software
urally into statistical applications and into R. The origi- when the function call is evaluated.
nal motivating use case, fitting models to data, remains True functional languages conform to these ideas
compelling. An expression such as both by what they do provide, such as pattern expres-
sions, and what they do not provide, such as procedural
irisFit <- lm(Sepal.Width ∼
. - Sepal.Length, iris) iteration or dynamic assignments. The classic tutorial
example of the factorial function, for example, could
calls a function that creates an object representing the be expressed in the Haskell language by the pattern:
linear model specified by the first argument, applied to
factorial x = if x > 0
the data specified by the second argument. The com-
then x * factorial (x-1) else 1,
putation is functional, well-defined by the arguments.
It returns an object whose properties provide the infor- plus some type information, such as that a value for x
mation needed to study and work with the fitted model. must be an integer scalar.
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 169
Is R a functional programming language in this 4. In order to compute with objects, we can define
sense? No. The structure of the language does not en- methods that are only used when objects are of certain
force functionality; Section 2.3 examines that struc- classes.
ture as it relates to functional programming and OOP.
Many programming languages reflect these ideas, ei-
The evolution of R from earlier work in statistical
ther from their inception or by adding some or all of
computing also inevitably left portions of earlier pre- the ideas to an existing language.
functional computations; Section 3 outlines the history. Is R an OOP language? Not from its inception, but
Random number generation, for example, is imple- it has added important software reflecting the ideas. In
mented in a distinctly “state-based” model in which an fact, it has done so in at least three separate forms, giv-
object in the global environment (.Random.seed) ing rise to some confusion that this paper attempts to
represents the current state of the generators. Purely reduce.
functional languages have developed techniques for Some of the confusion arises from not recognizing
many of these computations, but rewriting R to elimi- that the final item in the list above can be implemented
nate its huge body of supporting software is not a prac- in radically different ways, depending on the general
tical prospect and would require replacing some very paradigm of the programming language. A key dis-
well-tested and well-analyzed computations (random tinction is whether the methods are to be embedded in
number generation being a good example). some form of functional programming.
Functional programming remains an important Traditionally, most languages adopting the OOP
paradigm for statistical computing in spite of these lim- paradigm are not functional; either the language be-
itations. Statistical models for data, the motivating ex- gan with objects and classes as a central motivation
ample for many features in S and R, illustrate the value (SIMULA, Java) or added the paradigm to an exist-
of analyzing the software from a functional program- ing non-functional language (C++, Python). In such
ming perspective. Software for fitting models to data languages, methods were naturally associated with
remains one of the most active uses of R. The func- classes, essentially as callable properties of the objects.
tional validity of such software is important both for The language would then include syntax to call or in-
theoretical justification and to defend the results in ar- voke a method on a particular object, most often using
eas of controversy: Can we show that the fitted models the infix operator “.”. The class definition then en-
are well-defined functions of the data, perhaps with capsulates all the software for the class. Where meth-
other inputs to the model such as prior distributions ods are needed for other computations, such as special
considered as additional arguments? The structure of R method names in Python or operator overloading in
as described in Section 2.3 can provide support for an- C++, these are provided by ad-hoc mechanisms in the
alyzing functional validity. Equally usefully, such anal- language, but the method remains part of the class def-
ysis can also illuminate the limits of functional validity inition.
for particular software, such as that for model-fitting. In a language that is functional or that aspires to
behave functionally as S and R do, the natural role
2.2 Object-Oriented Programming of methods corresponds to the intuitive meaning of
The main ideas of object-oriented programming are “method”—a technique for computing the desired re-
also quite simple and intuitive: sult of a function call. In functional OOP, the particu-
lar computational technique is chosen because one or
1. Everything we compute with is an object, and ob- more arguments are objects from recognized classes.
jects should be structured to suit the goals of our com- Methods in this situation belong to functions, not to
putations. classes; the functions are generic. In the simplest and
2. For this, the key programming tool is a class def- most common case, referred to as a standard generic
inition saying that objects belonging to this class share function in R, the function defines the formal argu-
structure defined by properties they all have, with the ments but otherwise consists of nothing but a table of
properties being themselves objects of some specified the corresponding methods plus a command to select
class. the method in the table that matches the classes of the
3. A class can inherit from (contain) a simpler su- arguments. The selected method is a function; the call
perclass, such that an object of this class is also an ob- to the generic is then evaluated as a call to the selected
ject of the superclass. method.
170 J. M. CHAMBERS
We will refer to this form of object-oriented pro- the evolution of programming languages, especially for
gramming as functional OOP as opposed to the encap- statistics.
sulated form in which methods are part of the class In R a reference to an object is the combination
definition. of a name and a context in which to look up that
name; the contexts in R are themselves objects, of type
2.3 Their Relationship to R
“environment”. A reference is therefore the combi-
To understand computations in R, two slogans are nation of a name and an environment. (We’ll look at an
helpful: example shortly.)
Note that we are talking about references to objects;
• Everything that exists is an object.
most objects in R are not themselves reference objects.
• Everything that happens is a function call.
Languages implementing OOP in the traditional, non-
In contrast to languages such as Java and C++ where functional form essentially always include reference
objects are distinct from more primitive data types, ev- objects, in particular, what are termed mutable refer-
ery reference in R is to an object, in particular, to a sin- ences. If a method alters an object, say, by assigning
gle internal structure type in the underlying C imple- new values to some of its properties, all references to
mentation. This applies to data in the usual sense and that object see the change, regardless of the context of
also to all parts of the language itself, such as func- the call to the method. Whether the reassignment of
tion definitions and function calls. Computations that the property takes place where the object originated or
are more complex than a constant or a simple name down in some other method makes no difference; the
are all treated as function calls by the R evaluator, with object itself is the reference.
control structures and operators simply alternative syn- In contrast, the reference in R consists of a name and
tax hiding the function call. [Details and examples are an environment—the environment in which the object
shown in (Chambers, 2008, pages 458–468).] referred to has been assigned with that name. Most R
The two slogans, however, do not imply that com- programming is based on a concept of local references;
putations in R must follow either functional or object- that is, reassigning part of an object referred to by name
oriented programming in the senses outlined in the pre- alters the object referred to by that name, but only in
ceding sections. With respect to object-oriented pro- the local environment. If that local reference started out
gramming, R has several implementations that have as a reference in some other environment, that other
evolved as outlined in Section 3. These can be used reference is still to the original object.
by programmers to provide software following either To understand the relation of local references to
of the OOP paradigms. functional programming in R, an example and a few
Functional programming’s relationship to R is less more details of function call evaluation are needed. R
straightforward. The evaluation process in R does not evaluates function calls as objects. For example, when
enforce functional programming, but does encourage the evaluator encounters the call
it to a degree. In particular, the evaluation process in
lm(Sepal.Width
R contributes to functional programming by largely
∼ . - Sepal.Length, iris),
avoiding side effects when function calls are evaluated,
but some mechanisms in the language and especially it uses the object representing the call to create an en-
in the underlying support code can behave in a non- vironment for the evaluation.
functional way. To understand in a bit more detail, we The call identifies the function, also an object of
need to examine this evaluation process. course, typically referring to it by name. In this case
Computations in R are carried out by the R evalua- lm refers to an object in the stats package. That ob-
tor by evaluating function call objects. These have an ject has formal arguments [14 of them, in the case of
expression for the function definition (usually a refer- lm()]. The evaluator initializes an environment for
ence to it by name) and zero or more expressions for the call with objects corresponding to the formal argu-
the arguments to the call. The full details are some- ments, as unevaluated expressions built from the two
what beyond our scope here, but an essential question actual arguments and default expressions found in the
is how references to objects are handled. Any program- function definition. For details see Section 4 of the lan-
ming language must have references to data, which in guage definition, R Core Team (2013) and Chapter 13
R means references to objects. As discussed in Sec- of Chambers (2008). As an aside, the common use of
tion 3, the evolution of such references is central to terms like “call by value” (and the contrasting “call by
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 171
reference”) for argument passing in R is invalid and validity investigations, but developing tools for the pur-
misleading. Arguments are not “passed” in the usual pose would be a worthwhile project.
sense. Functional validation is a bottom-up construction.
Local references operate on all the objects in the en- The bottom layer consists of any functions called that
vironment to prevent side effects. The formal argument are not implemented in R, typically those that call rou-
data to lm() matches the expression iris, which tines in C++, C or Fortran. Included are the R primi-
refers to an object in the datasets package. Expres- tives, routines from numerical libraries and a variety of
sions that extract information from data work on that other standard sources, plus any new code brought in
object. But the local reference defined by data and to implement the computation in question. The func-
the environment of the evaluation is distinct from the tional validity of each of these is an empirical asser-
reference to iris in the package. If an assignment or tion. Some are clearly non-functional, such as the “«-”
replacement expression is encountered that would al- operator and assign() function that do nonlocal as-
ter data, the evaluator will duplicate the object first to signments.
ensure locality of the reference. Many computations in R eventually call subpro-
The local reference paradigm is helpful in validating grams not originally written for R. Each of these must
the functionality of an R function. Only the local as- be examined for potential non-functional behavior,
signments and replacements need to be examined; calls sometimes a daunting task. However, good practice in
to other functions will not alter references in this envi- using well-tested, preferably open-source supporting
ronment, so long as those functions stick to local refer- software will often provide a plausible basis.
ence behavior. If a function f() calls a function g() If R code includes an interface to code in C, Fortran
and both functions stick to local reference assignments, or other languages whose functional validity cannot be
then knowing that the value of a call to g() depends established, nothing more can be said. Other than such
only on the arguments is all that is needed; how g() code, functional validity is likely to fail for one of three
computes that value is irrelevant. reasons:
While local references help avoid side effects, they
• dependance on nonlocal values;
do not prevent computations from referring to objects
• using low-level computations in R known to violate
or other data outside the functions being called, and
functionality;
therefore potentially returning a result that depends on
• changing functions or other objects at run time.
a non-functional “state.” Whether a particular compu-
tation in R is strictly functional can only be determined A prime example of the first is the use of external
by examining it in detail, including all the functions data, such as the global options object, for convergence
that call code in C or Fortran. tolerances or other parameters for iterative numerical
The rest of this section takes a slight detour to con- computations. An example of the second is the inclu-
sider how one might do that examination. sion of pseudo-random values in the calculation. The
third problem might be caused, for example, by using
Validating Functionality in R
a function from the global environment.
In principle, the functional validity of particular The third danger is greatly reduced when the code
computations could be analyzed and either certified resides in the namespace of a package with explicit im-
or the limitations to functionality reported. Such func- port rules. Any reasonable approach to validating func-
tional validation would be useful in cases where ei- tionality would make this a requirement.
ther the theoretical validity or the implications of the My feeling is that most examples of failures could
result in an application are being questioned. Fitting be corrected to create functionally valid extensions of
models to data provides a natural example for both as- the computation in question. Tolerances are often orga-
pects. Given a function taking as arguments data and nized through the R options() function, explicitly
a model specification and returning a fitted model ob- designed to avoid functional programming by allowing
ject, can one validate that the returned object is func- users to set state parameters that are then queried by
tionally defined by the arguments? If not, can the non- the calculation. Once identified, such options could be
functionality be parametrized meaningfully, in which converted to additional arguments to the function being
case one can construct a functional version of the com- validated. [A general mechanism would be a version of
putation by including such parameters as implicit ar- getOption() that required the option in question to
guments? R does not have organized support for such be supplied as an argument.]
172 J. M. CHAMBERS
Pseudo-random values are used in a variety of proce- The form in which functional programming and
dures, including some optimization techniques where OOP were adopted was also influenced by the exist-
they are expected to provide more robust numerical be- ing software. Examining the history will explain many
havior by jittering values during iteration. These can of the choices made.
be made functionally valid by using well-defined gen- 3.1 From Hardware to Data and Libraries
erator software, such as that supplied in R itself, and
by treating the initial state of the generator as another The earliest general-purpose computers were pro-
nonlocal value to be incorporated as an additional ar- grammed in terms of the physical machine, its storage
and the basic operations provided to move data around
gument. One should always include an explicit initial-
and perform arithmetic and other operations. The IBM
ization via set.seed() in any example expected to
650 (Figure 1) was probably the first computer widely
be reproducible, and that practice can be the basis for a
sold and used (and the machine on which I did my first
functionally valid version of the computation. programming, around 1960).
Beyond these specific examples, numerical compu- In this pre-silicon world, storage for data or pro-
tations often depend on the underlying parameters of grams resided on a rotating magnetic drum, holding
the floating-point computations, for example, to select 2000 decimal words. Data could be read or written only
convergence criteria for iteration. Fortunately, several when the corresponding segment of the drum passed
decades of work by numerical analysts and hardware under the appropriate fixed head, so that physical posi-
designers have greatly standardized the specification of tioning of data was a serious aspect of performance.
the numerical engine in modern computers: just know- With this close view of the hardware, programming
ing 32-bit or 64-bit gets us a long way. languages (assembly languages for the actual machine
Developing a framework for validating functionality instructions) defined storage in terms of single physi-
seems to me an interesting cooperative research direc- cal units (words in the 650) and blocks of sequential
tion that could be of value to the statistical community. storage.
This was not an environment to encourage abstrac-
3. THE EVOLUTION OF FUNCTIONAL tion of ideas about data. However, by 1960 the first
PROGRAMMING, OOP AND R generation of “high-level” languages had been intro-
duced and would support profound changes. For statis-
The computational paradigms for functional pro- tical computing this meant primarily Fortran.
gramming and for object-oriented programming have In terms of data storage, Fortran actually continued
evolved from a sequence of changes in software, begin- the basic notion of single items (scalars) and contigu-
ning with the earliest programable computers. During ous blocks (arrays). Two major changes, however, were
the same period, software for statistics was also evolv- made. First, the contents were described in terms of
ing, one thread of which led through early libraries to their content, the first data types including integer and
S and then to R. floating point numbers. Second, the language encour-
There may be an appearance of earlier languages aged operations that iterated over the contents of the
being replaced by later and presumably improved ap- arrays. By interpreting an array as a sequence of equal-
proaches. It is true that each major revision asserts im- length subarrays, this indexing extended to matrices
and to multi-way tables.
provements that will extend our abilities to express our
Along with the new paradigm for data and facilities
ideas in software. However, none of the versions of S or
for iteration, the high-level languages encouraged soft-
R actually totally replaced earlier software paradigms.
ware to be organized in subroutines, so that a computa-
The current software in, and interfaced from, R il- tional method could be realized as one or several units
lustrates this evolution. R has developed important new of software. While the changes may seem modest from
techniques, but originated from the S language, repro- the current perspective, they in fact supported a major
ducing nearly all of S as it was described at that time. revolution in scientific computing generally and em-
S in turn went through several evolutionary changes phatically so in computing for statistics.
and was itself based on extensive earlier software, par- Algorithm series and other publications supported
ticularly subroutine libraries for Fortran programming. by professional societies began to accumulate refer-
Examining the history shows that a surprising portion eed, trustworthy procedures for many key computa-
of what we see now is structure inherited from the early tions. The statistics research group at Bell Labs de-
stages. veloped a large Fortran library that reflected our needs
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 173
F IG . 1. An IBM 650 computer, mid 1950s. Under the glass is the magnetic drum storage unit (memory), 2000 words for data and programs.
and our philosophy of research and data analysis. The library. That includes the graphical computations, in
book “Computational Methods for Data Analysis”, particular, features essential to S and R: separation of
Chambers (1977), did not present software but did re- graphic device specification from plotting; the plot, fig-
flect the tools that would later form the basis for S. Af- ure and margins structure; graphical parameter specifi-
ter an introduction and discussion of program design, cation to control style. These were not created for S but
the remaining six chapters covered computations sup- taken over from previous Fortran software, described in
ported by the library: Becker and Chambers (1977).
The Bell Labs software was in the background of
3. Data Management and Manipulation (including Chambers (1977), but general readers were given in-
sorting and table lookup). structions for obtaining similar software from publicly
4. Numerical Computations (approximations, available sources for the methods described. The pro-
Fourier transforms, integration). cedure would not always be simple, but the potential
5. Linear Models (numerical linear algebra, regres- availability marked a big step forward. For the first
sion, multivariate methods). time, statisticians could draw on an extensive range of
6. Nonlinear Models (optimization, nonlinear least relevant software to support their research, at least in
squares). principle. Various statistical software packages had ex-
7. Simulation of Random Processes (random num- isted for some time, but these were by and large ori-
ber generation and Monte Carlo). ented to routine analysis, to teaching or to specialized
8. Computational Graphics (plotting techniques, statistical techniques. Chambers (1977) and the soft-
scatter plots, histograms and probability plots). ware it reflected were aimed at research in statistics
and challenging data analysis. For this purpose, a more
Each of these was supported in the pre-S era by sub- general and open-ended approach was needed.
routines that would then become the basis for corre-
sponding functions in S. 3.2 From Fortran to S
Much of the organization for basic tools in R has For those involved with statistical theory or appli-
inherited, through S, the structure of the subroutine cations, in academia or industry, there were two main
174 J. M. CHAMBERS
limitations to the software described so far: availabil- allowed the statistician to formulate ideas directly for
ity and the programming interface. The Appendix to computation. The second version of S was licensed
Chambers (1977) was a set of tables for each of the for general use and described in Becker and Chambers
chapters, with rows corresponding to computational (1984).
tools that were more or less available to readers. The In S, the linear regression computation became a
last column of the table listed sources for the corre- simpler expression, storage for data was provided au-
sponding software. The entries in that column were tomatically and the returned model was now an object,
not uniformly helpful; in the best situation, a gener- with components for the coefficients and residuals:
ally available program library could be ordered that
fit <- reg(X, y).
provided a number of the subroutines, but these were
not designed for statistical applications, most being di- At this stage, S had a functional appearance, not
rected at numerical methods typically motivated by ap- radically unlike R, but its paradigm was essentially
plications in physics. More than half of the entries read an extension of the Fortran view. Dynamically cre-
“Listing,” implying a laborious and error-prone man- ated, self-describing objects were assigned in a sin-
ual procedure for the user. [As an example, many “bug gle workspace, but the underlying computations were
reports” came to us as a result of confusing an “I” and those of the earlier subroutine library: The functions in
a “1” when typing in the stable distribution software, S, documented in Becker and Chambers (1984), were
Chambers, Mallows and Stuck (1976).] in fact interfaces to Fortran subroutines: reg() would
Substantial in-house libraries, such as the one at in fact be programmed by calling lsfit().
Bell Labs, gave users a fairly wide range of compu- Although there was a macro facility in the language,
tations, supported by improved numerical and other al- programming a function in this version of S meant
gorithms. However, to apply the computations specif- “extending S” as described in the book of that name,
ically to a particular dataset with particular results in Becker and Chambers (1985). The definition of the
mind required some substantial additional Fortran pro- new function was programmed in an “interface lan-
gramming. That programming had to be repeated and guage” built on Fortran and compiled from its Fortran
revised for each analysis or research question. translation. As the main programming mechanism this
In the 1970s the situation was therefore a combina- was unsatisfactory, in the sense that extending the lan-
tion of improved basic computational capabilities but guage had a substantial learning barrier beyond using
with a high programming barrier for most statisticians. the language. The ability to access other software via
The classical linear regression in Fortran as shown in
an inter-system interface remains a key feature of R,
Becker and Chambers (1985), for example, was fairly
however, one still under active development.
straightforward:
Equally as important as the technical side was the
call lsfit(X, N, P, y, coef, resid). beginning of a network of statisticians involved in cre-
This computes the fitted model and returns it as vectors ating and sharing software through the medium of the
of coefficients and residuals. The data as objects are re- language. S was licensed from the early 1980s, avail-
stricted to arrays, a matrix X and vector y for the data able thanks to the newly distributed UNIX operating
and two arrays, coef and resid for the fitted model. system, with inexpensive academic licenses to encour-
The structure of the objects and their storage alloca- age adoption by university researchers, also following
tion remains the programmer’s responsibility. Linking the example of UNIX. Open-source software was not an
the basic computation to the data in an actual anal- option, but the research community was increasingly
ysis remained nontrivial and mistakes along the way involved and their interest stimulated further develop-
were likely. And this is for the most standard of mod- ments on our part, particularly from contacts with in-
els. Even given an extensive library, the programming terested users belonging to a “beta testing” network.
to apply the tools to most applications was a laborious, Simultaneously, we were thinking about a new ap-
error-prone activity, usually assigned to dedicated pro- proach to the language itself, emphasizing the pro-
grammers, research assistants or students. The statisti- gramming aspect of creating new software for sta-
cian’s ideas went through nontrivial translation before tistical and other quantitative applications. Described
they were expressed as computations. initially in Chambers (1987) as a language separate
The first two versions of S were designed to provide from S, this research later merged with other changes
an “interactive environment” that included the compu- to form the next version, labeled S3 and described in
tational areas described in Chambers (1977) and that the “blue book,” Becker, Chambers and Wilks (1988).
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 175
The slogans in Section 2.3 were basic to this version of Following the development of Simula in the late
S: everything is an object (stated explicitly) and func- 1960s, a variety of languages adopted this paradigm.
tion calls do all the computation (implicit). C++ added classes and methods to the C language;
This was functional programming (more or less) like C, it was initially used for a variety of program-
and object-based but not object-oriented. Objects were ming tasks implementing UNIX and application soft-
given structure through attributes attached to vectors ware for UNIX. In contrast to the “add-on” nature of
and through named components, but there were no C++, the Smalltalk language was a very pure, simpli-
classes or methods. fied realization of the ideas in Simula. Its major, and
revolutionary, application was to implement the graph-
3.3 From Data to Classes and Methods
ical user interface created at Xerox PARC in the 1970s.
The languages that originated the concepts of Many other versions of encapsulated OOP followed,
classes, properties, inheritance and methods came out either added on to existing languages or incorporated
of several motivations. The first, Simula, was con- into new languages from the start.
cerned with simulating systems. In retrospect, model- Dialects of the Lisp language and languages based
ing by simulation and modeling by fitting to data have on Lisp also incorporated OOP in various forms. Dur-
clear correspondences but with quite a different per- ing the 1980s, several research projects built statisti-
spective. For an example, suppose we want to simulate cal software on the basis of these languages, includ-
a simple model for an evolving population of individ- ing some elegant and potentially widely applicable sys-
uals. In R notation, but quite in the style of Simula, tems, notably LISP-STAT, Tierney (1990). As it turned
we define a class SimplePop. An object from this out, however, the most widely used version of OOP for
class is a specific realization of the model population statistical applications would come from a somewhat
with properties that define the probabilities of birth and casual approach in S.
death, and a vector of population size at each genera- 3.4 Functional OOP in S and R
tion. An object from the population is created by call-
ing the generator for the class: The chief motivation for introducing classes and
functional methods to S was the initial application: fit-
p <- SimplePop(birth = 0.08,
ting, examining and modifying diverse kinds of sta-
death = 0.1,
size = 100). tistical models for data. This remains arguably the
most compelling example for functional OOP in statis-
Rather than a single functional computation as in the tics. The “Statistical Models in S” project reported
case of linear regression, computations proceed by in Chambers and Hastie (1992)—the “white book”—
simulating the evolution of the population object p. brought together ten authors presenting software for
The object itself evolves; in the terminology of OOP, a variety of statistical models, from linear regression
it is a mutable reference. to tree-based models. The different models were pre-
A corresponding difference in the programming sented as consistently as possible.
paradigms of S and the emerging OOP languages was Each type of model had a definition as an object
that the latter did not take a functional view of com- having the information, such as coefficients and other
putation. Instead, computations largely consisted of properties, required. The object was created by a corre-
invoking a method on an object. In the Simple- sponding function taking as arguments the data, model
Pop example, the fundamental computation is to sim- description and possibly other controlling parameters.
ulate one generation of the evolution by invoking the A linear regression fit, for example, called the function
evolve() method lm():
p$evolve(). irisFit <- lm(Sepal.Width
∼ . - Sepal.Length, iris)
The value returned by this method is irrelevant. The
method’s purpose is to change the object, in this and returned a corresponding linear regression object.
case by simulating one further generation and ap- Further computations on this object would examine
pending the resulting value to a property in the ob- the model, return information about it, or update the
ject, namely, p$size. (See files “SimplePop.R” fit. The underlying computations still used basic soft-
and “SimplePopExample.R” in the supplementary ware similar to that for lsfit() and reg(). How-
materials.) ever, the description of the model (a formula) and the
176 J. M. CHAMBERS
data (a data frame) were designed to apply to statistical nary operators, but a number of encapsulated OOP lan-
models generally. For example, to fit a generalized lin- guages do the same, under the euphemism of operator
ear model the user called glm() with formula and data overloading.
arguments typically similar to those in a call to lm(). Second, the question of whether methods belonged
Other arguments would provide information suitable to to a class or a function was avoided by not having
the particular type of model (a link function, e.g.). them belong to either. Methods were assigned as or-
For the convenience of the user, further computa- dinary functions and identified by the pattern of their
tions should have a uniform appearance. To print or name: “function.class”. In any case, there were no
plot the fitted model or to compute predictions or an class objects and generic functions were ordinary func-
updated model corresponding to new data, the user tions that invoked UseMethod() to select and call
should call the same function [print(), plot(), the appropriate method. Neither the function nor the
predict() or update()] in the same way, regard- class was able to own the methods.
less of the type of model. The owner of the software Technically, the method dispatch in this version of
for a particular type of model, on the other hand, would OOP was instance-based, not class-based, since no rule
like to write just that version of each function, without enforced a consistent set of classes, that is, that all ob-
being responsible for the other versions. jects with a given first class string would have identical
Once stated, this is essentially a prescription for following strings for the superclasses. (R for some time
functional OOP: a class of objects for each kind of had an S3 class in the base package with a main class
model, generic functions for the computations on the string “POSIXt”, representing date/times, that could
objects and methods for each function for each class. be followed in different objects by one of two strings
Where one class of models is an extension of another that in fact represented specializations, i.e., subclasses,
(analysis of variance as a subclass of linear models, of “POSIXt”.)
e.g.), methods can be inherited when that makes sense. The classes and methods implemented for statistical
An implementation of generic functions and meth- models constituted a bare-bones version of functional
ods was introduced as part of the statistical models OOP, which is not to imply that this was a bad idea.
project and described in the Appendix to the white Advantages include a relatively low learning barrier for
book. The central mechanism was an explicit method programming and a thin implementation layer above
dispatch. The function print(), for example, would the previously existing language, which in turn means
evaluate the expression: less computational overhead in some circumstances.
UseMethod("print"). [Interestingly, the encapsulated OOP of Python has a
similarly thin implementation, with classes containing
The evaluation of this call would examine the methods but without defining the properties. A very
“class” attribute of the first formal argument to the analogous defense is made for that implementation, in
function. If present, this would be a character vector. Section 9 of the Python tutorial, Python (2013), e.g.]
Eligible methods would be those matching one of the A more formal version of functional OOP was devel-
strings in the class vector; if none matched, a method oped at Bell Labs, introduced into S in the late 1990s
matching the string “default” would be used. In- and described in Chambers (1998). By this time, S-
heritance was implemented by having more than one based software was exclusively licensed to the Insight-
string in the class, with the first string being “the” class ful Corporation, which later purchased the rights to the
and the remainder corresponding to inherited behavior. S software, in 2004, and was itself subsequently pur-
Chambers and Hastie (1992), in the discussion of
chased by Tibco.
classes and methods, noted that S differed from other
The new paradigm differed from S3 classes and
OOP languages because of its functional programming
methods in three main ways:
style. In fact, this version of functional OOP finessed
the resulting distinction from encapsulated OOP in two 1. Methods could be specified for an arbitrary subset
ways. First, the methods were dispatched according of the formal arguments, and method dispatch would
to a single argument, the first formal argument of the find the best match to the classes of the corresponding
generic function in principle. As a result, the methods arguments in a call to the generic function.
were unambiguously associated with a single class, as 2. Classes were defined explicitly with given prop-
they would be in encapsulated OOP. Methods were ac- erties (the slots) and optional superclasses for inherit-
tually dispatched on either argument to the usual bi- ing both properties and methods.
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 177
3. Generic functions, methods and class definitions my opinion, to projects with one or more of the char-
were themselves objects of formally defined classes, acteristics: a substantial amount of software is likely to
giving the paradigm reflectivity. be written; the application has a fairly wide scope in
The new paradigm was part of the version of S de- terms of either the data or the computing methods; or
scribed in the 1998 book and generally referred to the validity and reliability of the resulting software is
as S4. The S4 label is generally applied to this OOP important.
paradigm, whether in S or R. S4 methods never had Nothing prevents good software being written with-
much chance of replacing S3 methods. In practice, out formal tools in this case nor of bad software being
many S4 generic functions were based on functions written with them. However, there are several poten-
that already dispatched S3 methods. In this case, the tial benefits that can be summarized in parallel with
S3 generic function became the default S4 method. the main innovations noted above:
The work on S4 paralleled in time the arrival of R 1. Allowing methods to depend on multiple argu-
and its conversion into a broad-based joint project fol- ments fits the functional paradigm in R, in which the
lowing the initial publication by Ihaka and Gentleman arguments collectively define the domain of the func-
(1996). The implementation of R was designed to pro-
tion. Many functions in R are naturally applied to dif-
vide the functionality for S described in the blue book
ferent classes of objects, not necessarily corresponding
and white book, including S3 methods. Beginning in
to the first argument, or only to one argument. For ex-
2000, an implementation of the S4 version of OOP was
ample, when binary operators such as arithmetic are
added to R. The “Software for Data Analysis” book,
defined for a new class, a clean design of methods for
Chambers (2008), includes a description of the R ver-
sion. the operators often needs to distinguish three cases: the
Both versions of functional OOP will remain in R. first operand only belonging to the new class, the sec-
Many prefer the simplicity of the old form, and in ond operand only or both operands.
any case the very large body of existing code will 2. A formal definition for a class allows program-
not be discarded, and should not be. Some important mers to rely on the properties of objects generated from
extensions have been made, for example, by register- the class. Otherwise, the nature of the objects can only
ing the S3 methods from a package. Major forward- be inferred, if at all, from analyzing all the software
looking projects have typically used the newer version, that creates or modifies an object of this class.
for example, the Bioconductor project for bioinformat- 3. Having formal definitions for the generic func-
ics software, Gentleman et al. (2004), and the Rcpp in- tions, methods and class definitions themselves sup-
terface to C++, Eddelbuettel and François (2011). Re- ports a growing set of tools for installing and us-
cent changes, such as making the S3 and S4 versions of ing packages that include such functions, methods or
inheritance as compatible as possible, have been aimed classes.
at helping the two forms to coexist productively.
The benefits of a general, reliable form of functional
Any programming paradigm with some degree of
OOP extend to developments in the language itself. For
formality is likely to have a higher initial learning
example, reference classes were built on the S4 classes
barrier and require some extra specification from the
programmer. A comparison of encapsulated OOP pro- and methods, with no internal changes to the R evalua-
gramming with Python to that with Java is an inter- tor required.
esting parallel to S3 and S4. In both examples, the less 3.5 Reference Classes
formal version is likely to be quicker to learn, while the
more formal version provides more information about Functional OOP remains an active area in R. In ad-
the resulting software. That information in turn can dition, reference classes, introduced to R in 2010 in
support some forms of validation for the resulting soft- version 2.12.0, provide an implementation of encap-
ware, as well as tools to analyze and describe it. Python sulated OOP. Class definitions include the properties
and Java being rather different languages in other re- of the class with optional type declarations; proper-
spects as well, projects are not too likely to make a ties may also be optionally declared read-only. Class
choice between them based solely on the formality of definitions are themselves objects available at runtime.
the object-oriented programming. Methods are programmed as R functions, in which the
With R, a conscious choice is more likely. The argu- object itself is implicitly available, not an explicit ar-
ments for a more formal approach apply particularly, in gument. Methods can access or assign properties in the
178 J. M. CHAMBERS
object by name. These characteristics make the imple- invalidate the model. On the other hand, a data frame
mentation more Java-like, say, than Python- or C++- to be used in data cleaning and editing is an object that
like. needs to be mutable.
The programmer defines a reference class in the Having both paradigms in a single language is un-
R style, calling setRefClass() instead of set- usual. Some functional-style languages have imple-
Class(). The call returns a generator for the class mented functional OOP, notably Dylan, interesting for
and saves the class definition object as a side effect, as its parallels with OOP in R—see Shalit (1996), par-
does setClass() for S4 classes. ticularly the discussion of method dispatch. Other lan-
As a side comment, while R uses a model for most guages with a functional structure have nevertheless
of its objects and computations that is fundamentally added what is essentially encapsulated OOP, for exam-
different from the object references in encapsulated ple, Odersky, Spoon and Venners (2010) for the case of
OOP, a few key features made the implementation Scala.
of reference classes in R possible and even relatively We hope that providing both paradigms in R encour-
straightforward. Most importantly, the R data type ages software design that is natural for the application.
“environment” provides a vehicle for object refer- It does at the same time pose some subtleties. Refer-
ences and properties. Environments are universal in R ence classes and reference class objects are somewhat
and well supported by programming tools. In particu- abnormal in R. One needs to understand the distinc-
lar, the active binding mechanism, which allows access tions from standard R objects.
and assignment operations on objects in environments The key is the local reference mechanism noted in
to be programmed in R, was valuable in the implemen- Section 2.3. The R evaluator enforces local reference
tation. by duplicating an object when a computation might al-
Reference classes allow the use of encapsulated ter a nonlocal reference. Certain object types are ex-
OOP for objects that suit that paradigm more naturally ceptions that are not duplicated. The important ex-
than they do functional OOP. As noted in Section 3.3, ception is type “environment”. Reference classes
the essential distinction between functional and encap- are implemented by extending this type. Encapsulated
sulated OOP is whether an object is created, once, by a OOP in R uses no special form of the function call.
function call or is instead a mutable object that changes Method invocation is just a call to the “$” operator,
as methods are invoked. for which reference classes have an S4 method. Refer-
Statistical computing has examples clearly suited to ence semantics are obtained by one basic fact: environ-
each of these paradigms. The linear model returned ments are never duplicated automatically. The S4 class
by lm() is not open to mutation. Change the num- mechanism in R nevertheless allows one to subclass
bers in the coefficients or residuals and you no longer the “environment” type in order to define reference
have an object that should belong to that class. In con- class behavior.
trast, a model simulating a dynamic process such as The objects in the fields of a reference class object
the SimplePop class in Section 3.3 exists precisely can be ordinary R objects. They behave just as usual
for the purpose of changing, with its evolution being and when used in function calls will have regular local
the central point of interest. Other, less directly statisti- reference behavior in that call. It is only when fields in
cal computations in R also may correspond to mutable the reference object itself are replaced that the encap-
objects, for example, the frames or other objects in a sulated OOP is relevant.
graphical interface. Reference class objects are also good candidates
Not every case is clear cut. Sometimes, essentially for interfaces to other languages that implement the
the same class structure may be more appropriate for same OOP paradigm, such as Java, C++ or Python.
functional or encapsulated classes depending on the The R object could be a proxy for an object in the
purpose of the computation. Data frames are a prime other language with methods invoked in R but executed
example. This essential object structure is viewed natu- on the original object. The Rcpp interface to C++,
rally as functional when it is part of a functional object Eddelbuettel and François (2011), has a mechanism for
related to the data frame. For example, a fitted model extending C++ classes in this way. C++ classes can
that wanted to be fully reproducible could return the only be inferred from the source, meaning that either
data frame on which the fitting was based [e.g., lm() the programmer must supply the interface information
includes the model frame it constructs]. Such a data (as in the current implementation) or some processing
frame is clearly functional; again, change it and you of the source must be applied (currently used to export
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 179
functions from C++ but not classes). Java classes are future contributing programmers in using related pro-
accessible as objects, via “reflectance” in Java termi- gramming tools. The continuing rapid growth of R-
nology, so that in principle proxy classes in R should based software and the expanding, challenging range
be possible. The rJavax package by Danenberg (2011) of techniques it has to support make effective program-
has an initial implementation. For Python, methods are ming an important goal for the statistical community.
available from the objects but properties are not for- The importance of object-oriented programming is
mally defined. At the time of writing, basic interfaces likely to increase as statistical software takes on new
to Python exist, for example, Grothendieck and Bel- and challenging applications. In particular, the need
losta (2012), which could be extended to support class to deal with increasingly large objects and distributed
interfaces, with methods but not properties inferred sources of data will bring in specialized classes of data
from the Python class objects. and will need powerful computing tools. One impor-
Further work on these and other inter-system inter- tant direction has been to transform selected software
in R, particularly to speed up large-scale computations;
faces would be a valuable contribution to the user com-
see, for example, the companion paper Temple Lang
munity.
(2014). Complementary to this is to interface to other
languages and software when these provide better per-
4. SUMMARY formance on “big data” and other computationally
R plays a major role in the communication and dis- demanding applications. In particular, interfaces that
semination of new techniques for statistics and for re- match with object-oriented treatments for specialized
sults of statistical research more generally. In partic- forms of data can exploit the OOP facilities in R. The
ular, the many packages written in R or using R as a interface to C++, Eddelbuettel and François (2011),
is an example. Further development of such interfaces
base for interfacing to other software constitute an es-
will be of much benefit.
sential, rapidly growing resource. Therefore, the qual-
Functional programming is perhaps not such an ob-
ity of such software and the ability of programmers to
viously hot topic at the moment. However, the underly-
create and extend it are important.
ing philosophy that our software should be in the form
The current R language and its supporting function- of reliable, defensible units is very much part of R. Sit-
ality are the result of many years of evolution, from uations where the validity of statistical computations
early programming libraries through the S language to needs to be defended are likely to increase, given the
R, which itself has evolved and accumulated a variety growing need for statistical treatment of complex prob-
of programming techniques. This evolution has been lems for science and society.
much influenced by the functional and object-oriented
programming paradigms. New versions have continued ACKNOWLEDGMENTS
to include supporting software and programming tools
Thanks to the Associate Editor and the referees for
found useful at earlier stages along with improved ca-
some helpful comments on presentation and content.
pabilities.
Thanks especially to Vincent Carey for organizing and
The programming paradigms become especially rel-
encouraging the set of talks and papers of which this is
evant when the applications are complex or the quality part.
of the resulting software is important. In particular, the
versions of object-oriented programming in R can as-
sist in dealing with complexity of the underlying data. REFERENCES
As noted, R implements OOP in two forms, functional B ECKER , R. A. and C HAMBERS , J. M. (1977). Gr-z: A system of
and encapsulated. These are complementary, with one graphical subroutines for data analysis. In Proc. Interface Symp.
or the other suitable for particular applications. The lat- on Statistics and Computing 10 409–415.
B ECKER , R. A. and C HAMBERS , J. M. (1984). S: An Interactive
ter is essentially the form of OOP used in most other Environment for Data Analysis and Graphics. Wadsworth, Bel-
languages, but the former is distinctly different. Con- mont, CA.
siderable confusion has arisen in discussions of OOP B ECKER , R. A. and C HAMBERS , J. M. (1985). Extending the S
in R from not noting that distinction, which the present System. Wadsworth, Belmont, CA.
B ECKER , R. A., C HAMBERS , J. M. and W ILKS , A. R. (1988).
paper has tried to clarify. The New S Language. Chapman & Hall, Boca Raton, FL.
More generally, understanding the role of object- C HAMBERS , J. M. (1977). Computational Methods for Data Anal-
oriented and functional programming in R may assist ysis. Wiley, New York. MR0659716
180 J. M. CHAMBERS
C HAMBERS , J. M. (1987). Interface for a quantitative program- G ROTHENDIECK , G. and B ELLOSTA , C. J. G. (2012). rJython:
ming environment. In Comp. Sci. and Stat., Proc. 19th Symp. R interface to Python via Jython. R package version 0.0-4.
on the Interface 280–286. Available at http://CRAN.R-project.org/package=rJython.
C HAMBERS , J. M. (1998). Programming with Data: A Guide to I HAKA , R. and G ENTLEMAN , R. (1996). R: A language for data
the S Language. Springer, New York. analysis and graphics. J. Comput. Graph. Statist. 5 299–314.
C HAMBERS , J. M. (2008). Software for Data Analysis: Program- O DERSKY, M., S POON , L. and V ENNERS , B. (2010). Program-
ming with R. Springer, New York. ming in Scala, 2nd ed. Artima, Walnut Creek, CA.
C HAMBERS , J. M. and H ASTIE , T., eds. (1992). Statistical Models P YTHON (2013). The Python Tutorial. Python. Available at
in S. Chapman & Hall, Boca Raton, FL. http://docs.python.org/tutorial.
C HAMBERS , J. M., M ALLOWS , C. L. and S TUCK , B. W. (1976). R C ORE T EAM (2013). R Language Definition. R Foundation for
A method for simulating stable random variables. J. Amer. Statistical Computing, Vienna, Austria. ISBN 3-900051-13-5.
Statist. Assoc. 71 340–344. MR0415982 Available at http://cran.r-project.org/doc/manuals/R-lang.html/.
DANENBERG , P. (2011). rJavax: rJava extensions. R package S HALIT, A. (1996). The Dylan Reference Manual. Addison-
version 0.3. Available at http://CRAN.R-project.org/package= Wesley, Reading, MA.
rJavax. T EMPLE L ANG , D. (2014). Enhancing R with advanced compila-
E DDELBUETTEL , D. and F RANÇOIS , R. (2011). Rcpp: Seam- tion tools and methods. Statist. Sci. 29 181–200.
less R and C++ integration. Journal of Statistical Software 40
T IERNEY, L. (1990). LISP-STAT: An Object-Oriented Environment
1–18.
for Statistical Computing and Dynamic Graphics. Wiley, New
G ENTLEMAN , R. C., C AREY, V. J., BATES , D. M. et al. (2004).
York.
Bioconductor: Open software development for computational
biology and bioinformatics. Genome Biology 5 R80.